Most introductions to Apache Kafka stop at producer and consumer basics. They show you how to publish a message and read it back, declare victory, and leave out everything that matters when you’re running Kafka in production: partition design, consumer group rebalancing, exactly-once semantics, Kafka Connect for integrations, and the operational questions that will hit you at 2am. This article covers the patterns that experienced Kafka teams use — not the hello-world version.
Why Event-Driven Architecture Matters in 2026
The push toward event-driven systems is not just architectural fashion. Three concrete pressures are driving it:
- Service decoupling: Synchronous REST calls between services create cascading failures. If the inventory service is slow, the checkout service slows with it. An event-driven approach breaks that coupling — the checkout service publishes an event and moves on; the inventory service processes it when ready.
- Audit and replay: An event log is a complete history of what happened in your system. You can replay events to rebuild state, debug issues by replaying historical sequences, and audit every change without additional instrumentation.
- Real-time data pipelines: Moving data between databases, search indexes, caches, and analytics systems through Kafka is dramatically simpler than maintaining point-to-point ETL jobs.
Kafka is not always the right tool — we’ll cover that — but for systems where these properties matter, it remains the most mature and battle-tested option in 2026.
Kafka Architecture: The Concepts That Actually Matter
Topics, Partitions, and the Ordering Guarantee
Kafka provides ordering guarantees within a partition, not across an entire topic. This is one of the most common sources of bugs in Kafka implementations. If you have a topic with 12 partitions and you need events for a given customer to be processed in order, all events for that customer must go to the same partition.
The partition key is how you control this:
// Java producer — ensure all events for a customer_id go to the same partition
ProducerRecord<String, String> record = new ProducerRecord<>(
"order-events",
customerId, // This is the partition key — Kafka hashes it to select partition
eventPayload
);
producer.send(record);
When you use a partition key, Kafka applies a consistent hash to route records with the same key to the same partition. This gives you ordered processing per key while still parallelizing across many keys.
Partition count is a decision you should make carefully upfront because it’s difficult to change later. More partitions mean more parallelism (more consumers can work concurrently) but also more overhead (more TCP connections, more file handles on the broker). A practical starting point for most topics: start with the expected peak consumer count multiplied by 2, with a minimum of 6 and a maximum of 100 for most use cases.
Consumer Groups and the Rebalancing Problem
A consumer group is the mechanism for parallel processing. Each partition is assigned to exactly one consumer in a group at a time. Add consumers and partitions are redistributed. Remove consumers (or they crash) and partitions are reassigned.
Rebalancing is where most Kafka performance problems originate. During a rebalance, all consumers in the group stop processing while partitions are reassigned. With the default eager rebalancing protocol, this pause can be seconds to tens of seconds for large consumer groups.
The solution is the cooperative sticky assignor, available since Kafka 2.4:
# Consumer configuration for cooperative rebalancing
# application.properties (Spring Kafka) or direct config
spring.kafka.consumer.properties.partition.assignment.strategy=\
org.apache.kafka.clients.consumer.CooperativeStickyAssignor
# Or in Python with confluent-kafka
consumer_config = {
'bootstrap.servers': 'kafka:9092',
'group.id': 'order-processor',
'partition.assignment.strategy': 'cooperative-sticky',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # Always manage offsets manually in production
}
Cooperative rebalancing only reassigns the partitions that need to move, rather than revoking all assignments and starting over. For a 50-consumer group processing high-throughput topics, this is the difference between 200ms pauses and 30-second pauses during deployments.
Exactly-Once Semantics: When You Actually Need Them
Kafka offers three delivery semantics: at-most-once (messages can be lost), at-least-once (messages can be duplicated), and exactly-once (idempotent, transactional). Exactly-once is expensive — it requires transactional producers and transactional consumers, with significant throughput impact.
Before enabling exactly-once, ask honestly: can your consumers be made idempotent instead? An idempotent consumer handles duplicate messages correctly by design, and it’s often easier than the transactional overhead.
# Idempotent consumer pattern — track processed event IDs
def process_order_event(event: dict, db_session) -> None:
event_id = event["event_id"]
# Check if already processed (use database unique constraint or Redis SET NX)
if db_session.query(ProcessedEvent).filter_by(event_id=event_id).first():
logger.info(f"Skipping duplicate event {event_id}")
return
# Process the event
process_order(event["payload"], db_session)
# Record as processed atomically with the business operation
db_session.add(ProcessedEvent(event_id=event_id))
db_session.commit()
When you do need transactional exactly-once (Kafka Streams processing with guaranteed output exactly once to another topic), configure it explicitly:
# Transactional producer configuration
producer_config = {
'bootstrap.servers': 'kafka:9092',
'transactional.id': 'order-processor-instance-1', # Unique per producer instance
'enable.idempotence': True,
'acks': 'all',
'max.in.flight.requests.per.connection': 5,
}
# Transaction pattern
producer.init_transactions()
try:
producer.begin_transaction()
producer.produce('output-topic', key=key, value=value)
producer.send_offsets_to_transaction(offsets, consumer_group_metadata)
producer.commit_transaction()
except Exception as e:
producer.abort_transaction()
raise
Kafka Connect: The Pattern That Eliminates Custom ETL
Kafka Connect is consistently underutilized by teams who discover Kafka as a messaging system and never explore its integration capabilities. Connect is a framework for building and running connectors that stream data between Kafka and external systems — databases, object stores, search indexes, SaaS APIs — without writing consumer/producer code.
Debezium is the most important Kafka Connect source connector. It implements Change Data Capture (CDC) by reading the database’s binary replication log and publishing every row change as a Kafka event. This is transformative for data integration:
# Deploy Debezium Postgres connector via Kafka Connect REST API
curl -X POST http://connect:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "postgres-orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "replicator",
"database.password": "secret",
"database.dbname": "orders_db",
"table.include.list": "public.orders,public.order_items",
"topic.prefix": "cdc",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "filtered",
"decimal.handling.mode": "double",
"slot.name": "debezium_orders"
}
}'
# Results in topics: cdc.public.orders, cdc.public.order_items
# Each message contains: before state, after state, operation type, timestamp
Now every change to your orders table is a Kafka event. You can build downstream consumers that update Elasticsearch for search, materialize into Redis for caching, stream into your data warehouse, and power real-time analytics dashboards — all from the same CDC stream, without polling the database or maintaining triggers.
The sink side is equally powerful. The Elasticsearch Sink connector, S3 Sink connector, and JDBC Sink connector let you write data to these systems from Kafka topics without writing consumer code:
# Elasticsearch sink connector — automatically indexes events from topic
{
"name": "elasticsearch-orders-sink",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "3",
"topics": "cdc.public.orders",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"key.ignore": "false",
"schema.ignore": "true",
"behavior.on.malformed.documents": "warn"
}
}
Schema Registry: Preventing the Contract Breaking Problem
As your event-driven system grows, schema evolution becomes a serious problem. A producer changes an event schema and breaks consumers that expect the old format. Without a schema registry, these breaks are discovered at runtime.
Confluent Schema Registry (open-source) enforces schema compatibility rules at produce time. The producer registers its schema; the registry checks it against the registered schema for that subject; if it’s not backward compatible, the produce call fails before any consumer sees the bad message.
# Avro schema for an order event
{
"type": "record",
"name": "OrderCreated",
"namespace": "com.example.orders",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "total_cents", "type": "long"},
{"name": "status", "type": "string"},
{"name": "created_at", "type": "long", "logicalType": "timestamp-millis"},
{"name": "shipping_address", "type": ["null", "string"], "default": null}
]
}
# Register schema and check compatibility
curl -X POST http://schema-registry:8081/subjects/order-events-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "<schema-json>"}'
# Check compatibility before deploying a schema change
curl http://schema-registry:8081/compatibility/subjects/order-events-value/versions/latest \
-d '{"schema": "<new-schema-json>"}'
Kafka vs. Alternatives: When Not to Use Kafka
Kafka is not the right tool for every use case. Be honest about the operational overhead:
- Small scale, simple queuing: If you have fewer than 10 services and moderate throughput, Redis Streams or RabbitMQ are dramatically simpler to operate. Kafka’s Zookeeper/KRaft requirements, replication configuration, and consumer group management are significant overhead.
- Request/response patterns: If your consumers need to return results to producers synchronously, Kafka is the wrong primitive. Use gRPC or REST with a message queue only if you specifically need async decoupling.
- Serverless/very low volume: Kafka is designed for high throughput. At very low message rates, the cost-per-message is high compared to SQS, Pub/Sub, or even a database queue.
Kafka shines at high throughput, ordered event streams, long retention, replay capability, and fan-out to many consumer groups. If those properties aren’t requirements, simpler tools are usually better.
Conclusion
Event-driven architecture with Kafka provides genuine benefits for systems that need service decoupling, audit logs, real-time data integration, and high-throughput processing. The patterns that matter in production — partition key design for ordering, cooperative rebalancing, idempotent consumers, Debezium CDC, and Schema Registry — are what separate functional Kafka implementations from brittle ones.
The operational investment is real. Kafka is not a managed service you can ignore after deployment. But for systems where its capabilities align with your requirements, it remains the most capable and battle-tested event streaming platform available.
Start with a clear understanding of your ordering requirements, design your partition keys around them, implement idempotent consumers, and add Kafka Connect CDC before writing any custom integration code. These four decisions will determine the long-term health of your Kafka implementation more than anything else.

[…] Event-Driven Architecture with Kafka and Kafka Connect: Beyond Hello World […]
[…] 使用 Kafka 和 Kafka Connect 的事件驱动架构:超越 Hello World […]