Understanding Communication Patterns and Kafka in a Distributed System

Async vs Sync

Synchronous tasks are high-priority tasks that require immediate execution and user feedback. They are generally associated with user actions that need immediate system response.
Asynchronous tasks are tasks that can be processed in the background and are not time-sensitive. They don't need immediate user feedback and often involve long-running operations that can be offloaded to background systems.

In each of these scenarios, you can see a clear distinction between the tasks that require immediate, synchronous execution and those that can be handled in the background, asynchronously.

Scenario	Synchronous Task (High Priority)	Asynchronous Task (Lower Priority)
Banking Sector	Transactions (Funds transfer, balance check etc.)	Bank Statement generation (6 months / 1 year history)
Uber	Ride Search / Booking	Invoices Generation (Past 2 months mailed)
YouTube	Watch, Like, Comment (Real-time user interaction)	Video Upload & Processing (Can take time in the background)
Amazon	Placing Order / Payment (Real-time feedback needed)	Order Delivery (Can be slow, tracking updates in the background)
General	Quick Notifications (Order placed, Payment successful)	Delayed Notifications (Discount offers, Order restocked)

Example to understand the concepts

Lets consider Amazon has two services

1) Order Management Service (has multiple servers) 2) Delivery Service

Untitled

Multiple Instances of Order Management Service is sending data to a single Delivery Services. So, There are too many requests.

User place order to order management service(less work in it) and all the multiple servers will send back to the delivery server (which has to process the request). There would be a lot of requests coming, it can cause overloading, bottlenecking the system.

To over come this issue we will use Queues (FIFO)

Untitled

Note : The request received to order Management Service should be placed in queue and then only response should be sent back to the user request. (Then, even if Order Mgmt service fails the system wouldn’t lose the information).

What if Queues fail? In case a queue fails, having a replica of the queue will ensure that no messages are lost. This is part of the high-availability principle. Distributed message queues can also provide this feature, as they replicate the data across different nodes.
What if Delivery Services fail? In the event of Delivery Services failure, a retry mechanism can be implemented. The message can be sent back to the queue and retried after a specific interval or after the service is back online.
What if the queue is full? If a queue is full or getting overloaded, additional queues can be added to handle more messages.

Role of Message Queues:

Asynchronous Communication: Message queues enable asynchronous communication between different services, meaning one service doesn't need to wait for another service to complete its task before moving on to its next task.
Load Balancing: They can also help distribute the load evenly among different services or instances of a service.
Controlling Throughput: By adjusting the rate at which messages are sent or received, you can control the throughput of the system.
Decoupling Components: Message queues decouple the services, meaning the services do not need to interact with each other directly.
Scaling: As the load increases, more queues or services reading from the queues can be added to scale the system.
Buffering & Throttling: Queues can act as a buffer, holding messages when the processing service is not ready. Throttling can be implemented to control the rate of message processing based on the current load on the system.

Non-Distributed Message Queue vs Distributed Message Queue

	Non-Distributed Message Queue	Distributed Message Queue
Availability	Lower: Since the system isn't distributed, a single point of failure can cause the entire service to be unavailable.	Higher: Distributed queues are designed to avoid single points of failure. If one node fails, the system can still continue to operate.
Message Persistence	Depends on the specific queue technology and its configuration. Some may support persistent messaging, but may not be as robust as distributed systems.	More Robust: Messages in a distributed queue can be replicated across multiple nodes, ensuring that no data is lost in case of a node failure.
Scalability	Limited: The capacity is limited by the resources of the single machine where it operates.	Higher: Since the system is distributed, it can be easily scaled up by adding more nodes to the system.
Throughput	Lower: Being limited to a single machine's resources, the throughput might be limited compared to distributed systems.	Higher: As you can distribute the load across multiple machines, you can achieve much higher throughput.
Geographical Distribution	Limited: All the data resides on a single machine, which might be located in one geographic location.	Enabled: Nodes can be spread across different geographical locations which can help in reducing latency and enhancing data locality.
Reliability	Lower: Since there's only a single machine, if it fails, the service becomes unavailable.	Higher: The distributed nature of these systems allows for built-in redundancy. If one node fails, others can take over its load.
Resilience	Lower: A single machine's failure can disrupt the whole system.	Higher: Even when individual nodes fail, the system as a whole can continue functioning, making it highly resilient to faults.

Producer/Consumer vs Publish/Subscribe

Producer/Consumer (One-to-one communication): In this pattern, a producer (Order Management Service) sends messages to a queue, and a consumer (Delivery Service) reads from that queue. The consumer and producer are decoupled. The producer can continue to add messages to the queue without knowing about the consumer's state. On the other hand, the consumer can consume messages from the queue at its own pace.

Example: Imagine a scenario where customers are placing orders on an e-commerce platform (Amazon, for instance). Each order can be seen as a message produced by the Order Management Service. The Delivery Service, which is responsible for processing these orders, acts as the consumer. It takes orders from the queue and processes them for delivery.

Tools: RabbitMQ, Apache Kafka, Amazon SQS.

Publish/Subscribe (One-to-many communication): In this pattern, a publisher (Order Management Service) sends messages to a topic, and multiple subscribers (like Delivery Services, Receipt Services) can receive those messages. It's a one-to-many relationship where one publisher sends messages to multiple subscribers who have expressed interest in receiving those messages.

Example: Consider the same e-commerce platform. When a customer places an order, the Order Management Service publishes a message (order details). Multiple services like Delivery Service (to process delivery) and Receipt Service (to generate a receipt) are interested in this message. They subscribe to this topic and receive the message.

Tools: Google Pub/Sub, Apache Kafka, RabbitMQ, AWS SNS.

These two communication patterns serve different purposes and the choice between them depends on the specific use case. The Producer/Consumer pattern is used when you need to distribute tasks among different workers (like processing orders). The Publish/Subscribe pattern is used when you want to broadcast messages to multiple receivers (like notifying different services about a new order).

Kafka

Reading Materials :

The Apache Kafka Handbook – How to Get Started Using Kafka

SoftwareMill Kafka Visualization

Apache Kafka is a distributed streaming platform that is designed for high-throughput, real-time data streaming, and processing. Kafka is used in cases where JMS, RabbitMQ, and AMQP may not even be able to cope due to volume and responsiveness.

Let's understand key Kafka concepts using an e-commerce example:

Producer: Producers are the source of data in Kafka. They push records (messages) into topics. In our e-commerce example, the Order Management Service can be considered as a Producer, creating an order message every time a user places an order.
Consumer: Consumers read data from Kafka topics. In our e-commerce example, the Delivery Service can be considered a Consumer, reading the order messages and processing them for delivery.
Topic: A topic is a category or a feed name to which records are published. Topics in Kafka are always multi-subscriber, meaning a topic can have zero, one, or many consumers that subscribe to the data written to it. In our case, there could be a topic named "Orders" to which the Order Management Service publishes and from which the Delivery Service subscribes.
Broker: A Kafka cluster consists of one or more servers known as brokers. Each broker can handle terabytes of messages. The messages are written to topics which reside on the brokers.
Cluster: A Kafka cluster is a set of one or more brokers. This set can be scaled easily without any downtime.
Partition: Topics can be divided into partitions for better organization and scalability. Each partition can be hosted on a different server, which means a single topic can be scaled horizontally across multiple servers to increase throughput. Each message within a partition gets an incremental id, called an offset.
Offset: In Kafka, an offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition.
Replica: Replication is used in Kafka for fault-tolerance. Each partition can be replicated across multiple brokers so that even if a broker goes down, a copy of the data is maintained.
Consumer Group: Kafka allows each consumer to be part of a consumer group. Consumer groups allow a pool of consumers to divide the processing of data over the topics.

In essence, in a running Kafka cluster, producers send records to topics. These records are then divided and stored across different partitions which can be on different brokers. Consumers, which can be organized in consumer groups for better efficiency and scalability, then consume these records from the topics.

Apache Kafka's capability to handle real-time data makes it a great choice for use-cases like real-time analytics, log aggregation, stream processing, and event sourcing.

What does a message consists of?

a key - usually strings or integers. can be null, but when they are included they are used for dividing topics into partitions.
a value - details about the event that happened (string / object)
a timestamp
a compression type
headers for metadata (optional)
partition and offset id (once the message is written to a topic)

How Consumer Consumes and keep track

In Apache Kafka, a Consumer keeps track of what records it has consumed by saving its "offset" or position in the stream. The consumer sends periodic heartbeat updates to Kafka with its latest offset. This mechanism allows a consumer to stop and restart without losing its place.

However, Kafka itself does not track whether a message has been consumed by all consumers. Instead, each consumer (or more accurately, each consumer group) is responsible for tracking its own progress through the logs. The model Kafka follows is more of a "fire and forget" where messages are written to a topic, and consumers read from the topic at their own pace.

When we talk about multiple consumers, they are usually organized in consumer groups. A consumer group is a set of consumers which cooperate to consume data from Kafka topics. Within a consumer group, each consumer is assigned a set of partitions from topics it has subscribed to, so that each message is delivered to one consumer in the group. This is how Kafka provides message delivery semantics at the level of the consumer group.

If you have multiple consumer groups, and you want to make sure that all consumer groups have read a particular message, you would need to monitor the offsets committed by each group and cross-check them with the latest offset in each partition. There isn't a built-in way in Kafka to verify that every consumer group has consumed a particular message, because Kafka's design expects the consumer to handle message tracking.

It's worth noting that the data in Kafka is kept for a configurable retention period, irrespective of whether the data has been consumed or not. So, the data will remain available for all consumers to read during this retention period, regardless of the consumption rate of other consumers.

Replication of Partition

In Apache Kafka, each partition of data is replicated across multiple servers (brokers) for data reliability and system fault tolerance. This means that if one broker fails, the data is still available on another broker. For each partition, one broker acts as the "leader," handling all data requests, while the others are "followers" that duplicate the leader's data. The replicas for a given partition are spread across different brokers to ensure data availability even in the event of a broker failure, which is a cornerstone of Kafka's fault-tolerance design.

what is zookeeper and why is it not used now in Kafka?

Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is used by many distributed systems for coordinating tasks, such as electing a master server, managing group membership, and managing metadata.

Zookeeper used to be crucial for Kafka. It helped Kafka keep track of the status of Kafka clusters, topics, partitions, etc. Basically, it's like a manager for Kafka, helping it keep everything organized and run smoothly.

However, maintaining Zookeeper along with Kafka was a bit complex and could slow things down. Also, having Zookeeper as a separate system could potentially create a single point of failure, making the whole system less reliable.

Since Kafka 2.8.0, a big change was made. Kafka started to incorporate its own internal metadata management system, called KRaft mode (KRaft stands for Kafka Raft metadata mode), which allows it to operate without Zookeeper. This was a big step forward in simplifying Kafka's architecture, making it easier to manage, and improving its overall performance and reliability.

So, while Zookeeper was once necessary for Kafka, advancements in Kafka's own capabilities have made it possible for Kafka to operate independently. This is good news for anyone working with Kafka because it simplifies the overall system and makes it more robust and efficient.

Kafka vs RabbitMQ

Criteria	Apache Kafka	RabbitMQ
Messaging Model	Log-based publish-subscribe (pub/sub) model optimized for real-time data feeds. Kafka retains all messages for a set period, allowing consumers to replay the stream.	Supports multiple messaging models like pub/sub, request/reply, and point-to-point. RabbitMQ's focus is more on message routing, delivery, and guarantee.
Performance	High throughput, handling millions of messages per second, which makes it ideal for heavy-load scenarios.	Good performance for many use-cases, but typically doesn't match Kafka's extremely high throughput.
Durability	Kafka stores data on disk and provides intra-cluster replication, ensuring message durability.	RabbitMQ also provides message durability by storing data on disk and supports replication between nodes.
Use-cases	Best for real-time streaming data analysis, log aggregation, and event sourcing.	More suitable for traditional messaging, task distribution, and situations where complex routing to multiple consumers is needed.
Ease of Use	Kafka is more complex to set up and manage, due to its distributed nature and more configuration options.	RabbitMQ is easier to set up and manage, and it offers a user-friendly web-based management interface.
Message Delivery Semantics	At-least-once delivery is standard, but exactly-once delivery is also supported with more complex configuration.	Supports at-most-once (where messages can be lost), at-least-once (where messages can be duplicated), and exactly-once (where message delivery is assured but with considerable performance implications) delivery semantics.
Language Support	Kafka provides the producer and consumer API in multiple languages including Java, Python, .NET, Go, etc.	RabbitMQ has wide language support with libraries available for many modern programming languages.

Both Kafka and RabbitMQ are on top of TCP layers.

Quiz

Question	Answer Options	Correct Answer
Which communication pattern is best suited for message queues in a distributed system?	- Synchronous communication - Asynchronous communication - Direct peer-to-peer communication - Centralized client-server communication	Ans! Asynchronous communication
What is the advantage of the pub/sub pattern over point-to-point messaging?	- Reduced network latency for message delivery - Simpler architecture with fewer components - Scalability by allowing multiple subscribers to receive the same message - Elimination of message queues for faster communication	Ans! Scalability by allowing multiple subscribers to receive the same message
How does Kafka ensure high throughput and scalability?	- By using a centralized message broker for all communication - By minimizing the number of partitions in a topic - By utilizing a distributed architecture with multiple brokers - By enforcing strict consistency across all consumers	Ans! By utilizing a distributed architecture with multiple brokers
What is Apache Kafka primarily used for?	- Database management - Real-time data processing and streaming - Front-end web development - Network security monitoring	Ans! Real-time data processing and streaming
What is a topic in Kafka?	- A category or feed name to which records are published - A unique identifier for each Kafka consumer - The primary component responsible for data processing - A centralized data store for storing Kafka messages	Ans! A category or feed name to which records are published
What does "replication" mean in Kafka?	- The process of copying messages from one topic to another - The process of copying messages across different partitions for fault tolerance - The process of copying data from Kafka to an external database - The process of copying data from producers to consumers	Ans! The process of copying messages across different partitions for fault tolerance
Which of the following is NOT a typical use case for Kafka?	- Real-time stream processing of data - Log aggregation and collection from various sources - Managing distributed databases - Activity tracking and monitoring	Ans! Managing distributed databases
In Kafka, what is a broker?	- A message sender in the pub/sub pattern - A consumer that receives messages from a topic - A node in the Kafka cluster responsible for message storage and handling - A topic administrator who manages access permissions	Ans! A node in the Kafka cluster responsible for message storage and handling
What is the role of a "producer" in Kafka?	- To consume messages from topics and process them - To manage the distribution of messages across topics - To store and manage message logs - To publish messages to Kafka topics	Ans! To publish messages to Kafka topics
What is the level of decoupling between publishers and subscribers in the pub/sub pattern?	- Tight coupling, as publishers need to be aware of all subscribers - Loose coupling, as publishers are unaware of subscribers' identities - No coupling, as there is no direct interaction between publishers and subscribers - Bidirectional coupling, as both publishers and subscribers communicate directly	Ans! Loose coupling, as publishers are unaware of subscribers' identities