Pages

Tuesday, January 2, 2024

Kafka In brief

Topic - a channel for messages. In Kafka, messages are organized into topics. A topic is a logical channel or feed name to which messages are published.

Partition - queue for sending messages within a topic. (ordered, immutable sequence of messages)

Broker ( Leader and Follower) - instance where read and write operations are performed. Maintaining the published data.

Leader - is the node responsible for all reads and writes for the given partition

Follower - A follower acts as normal consumer, pulls messages and up-dates its own data store

Kafka Cluster - group of brokers

Producer - Producers are the publisher of messages to one or more Kafka topics. Producers send data to Kafka brokers. Every time a producer publishes a message to a broker, the broker simply appends the message to the last segment file. Actually, the message will be appended to a partition. Producer can also send messages to a partition of their choice.

Consumer - Consumers read data from brokers. Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers.

Replication factor - number of brokers to create. For each partition in a topic, a broker will be assigned as leader where all the read and write operations will be performed, while the other brokers will be used for replication.


Kakfa Setup

-- Start zookeeper
bin/zkServer.sh start
--Start Kafka service
bin/kafka-server-start.sh config/server.properties
--Create topic
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic Hello-Kafka
localhost:2181 - default zookeeper server
localhost:9092 - default kafka server
--Start producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic Hello-Kafka
--start consumer
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic Hello-Kafka --from-beginning
--Altering topics
bin/kafka-topics.sh --alter --bootstrap-server localhost:9092 --partitions 2 --topic Hello-Kafka
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic Hello-Kafka


Consumer Group

When the consumer group is first created, you can tell it whether to start with the oldest or newest message that kafka has stored using the auto.offset.reset property.

Consumers in a consumer group share ownership of the partitions in the topics they subscribe to. When we add a new consumer to the group, it starts consuming messages from partitions previously consumed by another consumer. The same thing happens when a consumer shuts down or crashes; it leaves the group, and the partitions it used to consume will be consumed by one of the remaining consumers. Reassignment of partitions to consumers also happen when the topics the consumer group is consuming are modified (e.g., if an administrator adds new partitions). Moving partition ownership from one consumer to another is called a rebalance


Kafka Consumers

Moving partition ownership from one consumer to another is called a rebalance.

The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a Kafka broker designated as the group coordinator. Heartbeats are sent when the consumer polls (i.e., retrieves records) and when it commits records it has consumed.

The first time poll() is called with a new consumer, it is responsible for finding the GroupCoordinator, joining the consumer group, and receiving a partition assignment. If a rebalance is triggered, it will be handled inside the poll loop as well. The heartbeats that keep consumers alive are sent from within the poll loop.


Configuring Consumers

fetch.min.bytes - This property allows a consumer to specify the minimum amount of data that it wants to receive from the broker when fetching records.

fetch.max.wait.ms - Lets you control how long to wait before responding to the consumer.

If you set fetch.max.wait.ms to 100 ms and fetch.min.bytes to 1 MB, Kafka will receive a fetch request from the consumer and will respond with data either when it has 1 MB of data to return or after 100 ms, whichever happens first.

max.poll.records - This controls the maximum number of records that a single call to poll() will return. This is useful to help control the amount of data your application will need to process in the polling loop.

session.timeout.ms - The amount of time a consumer can be out of contact with the brokers while still considered alive defaults to 10 seconds.


Apache kafka is distributed streaming platform for building real time data pipelines and streaming applications at massive scale. Originally developed at Linkedin, kafka was created to solve the problem of ingesting high volumes of event data with low latency. It was open sourced in 2011 through the Apache software Foundation and has since become one of the most popular event streaming platforms. Event streams are organized into topics that are distributed across multiple servers called brokers. This ensures data is easily accessible and resilient to system crashes. Applications that feed data into kafka are called producers while those that consume data are called consumers. Kafka's strength lies in its ability to handle massive amount of data, its flexibility to work with diverse applications, and its fault tolerance. This sets it apart from simple messaging systems. 

Kafka has become a critical component of modern system architecture due to its ability to enable real time scalable data streaming. 

Kafka serves as a highly reliable, scalable messaging queue. It decouples data producers from data consumers which allows them to operate independently and efficiently at scale. A major use case is activity tracking. Kafka is ideal for ingesting and storing real time events like clicks, views and purchases from high traffic websites and applications. Companies like Uber and netflix use kafka for real time analytics of the user activity. For gathering data from many sources, kafka can consolidate disparate streams into unified real time pipelines for analytics and storage. This is extremely useful for aggregating internet of Things and sensor data. 

In microservices architecture, kafka serves as a real time data bus that allows different services to talk to each other. Kafka is also great for monitoring and observability when integrated with the ELK stack. It collects metrics, application logs and network data in real time, which can then be aggregated and analyzed to monitor overall system health and performance. Kafka enables scalable stream processing of big data through its distributed architecture. It can handle massive volume of real time data streams. For eg processing user click streams for product recommendations, detecting anomalies in the IOT sensor data or analyzing financial market data. Its core queuing and messaging features power and array of critical applications and workloads. 

Kafka has some limitation though:

It is quite complicate and has steep learning curve. It requires some expertise for setup, scaling and maintenance. It can be quite resource intensive requiring substantial hardware and operational investment. This might not be ideal for small startups. It is also not suitable for ultra-low latency application like high frequency trading where microseconds matters. 

Kafka is fast, optimized for high throughput. It is designed to move large number of records in short amount of time. Think of it as very large pipe moving liquid. The bigger the diameter of the pipe, the larger the volume of liquid that can move through. Kafa is fast refers to its ability to move lot of data efficiently. 

Efficient Design

Kafka's reliance on sequential I/O. There are 2 types of disk access patterns - random and sequential. For hard drives it takes a time to physically move the arm to different location in magnetic disks. This is what makes random access slow. For sequential access though, since your arm doesn't need to jump around, it is much faster to read and write blocks of data one after the other. Kafka takes advantage of this by using only append-only log as its primary data structure. An append-only log adds new data to the end of the file. This access pattern is sequential. 

On the modern hardware with array of these hard disks, sequential writes reach hundreds of megabytes per second, while random writes are measured in hundreds of kilobytes per second. Using hard disk has its cost advantage too, compared to SSD. hard disks come at 1/3rd of the price but with about three times the capacity. Giving kafka a large pool of cheap disk space without any performance penalty, means that kafka can retain the messages for a long period of time, a feature that was uncommon to messaging system before kafka. The second design choice that gives kafka its performance advantage is its focus on efficiency. Kafka moves a lot of data from network to disk and from disk to network. It is critically important to eliminate excess copy when moving pages and pages of data between the disk and network. This is where zero copy principle comes into picture. Modern unix operating systems are highly optimized to transfer data from disk to network without copying data excessively. Let's dig deeper to see how its done. First we look at how kafka sends page of data on disk to the consumer when zero copy is not used at all. First the data is loaded from the disk to OS cache. Second the data is copied from the OS cache into the kafka application. Third data is copied from kafka to socket buffer. And fourth the data is copied from the socket buffer to the network interface card buffer and finally data is send over the network to the consumer. This is clearly inefficient, there are 4 copies and 2 system calls. 

Now let's compare this to zero copy.

The first step is same. data page is loaded from disk to OS cache. With zero copy, the kafka application uses a system call called sendfile() to tell the OS to directly copy the data from the OS cache to the network interface card buffer. In this optimized path, the only copy is from the OS cache to the network card. With a modern network card, this copying is done with DMA stands for direct memory access. When DMA is used the cpu is not involved, making it even more efficient. 

To recap, sequential I/O and zero copy principle are the cornerstone to kafka's high performance. Kafka uses other techniques to squeeze every ounce of performance out of modern hardware, but these two are most important.


What are the elements of Kafka?

The most important elements of Kafka are as follows:

Topic: It is a bunch of similar kinds of messages.

Producer: Using this, one can issue communications to the topic.

Consumer: It endures to a variety of topics and takes data from brokers.

Broker: This is the place where the issued messages are stored.


What role does ZooKeeper play in a cluster of Kafka?

Apache ZooKeeper acts as a distributed, open-source configuration and synchronization service, along with being a naming registry for distributed applications. It keeps track of the status of the Kafka cluster nodes, as well as of Kafka topics, partitions, etc.

Since the data is divided across collections of nodes within ZooKeeper, it exhibits high availability and consistency. When a node fails, ZooKeeper performs an instant failover migration.

ZooKeeper is used in Kafka for managing service discovery for Kafka brokers, which form the cluster. ZooKeeper communicates with Kafka when a new broker joins, when a broker dies, when a topic gets removed, or when a topic is added so that each node in the cluster knows about these changes. Thus, it provides an in-sync view of the Kafka cluster configuration.

Can Kafka be utilized without ZooKeeper?

It is impossible to use Kafka without ZooKeeper because it is not feasible to go around ZooKeeper and attach it in a straight line with the server. If ZooKeeper is down for a number of causes, then we will not be able to serve customers’ demands.

Elaborate the architecture of Kafka.

In Kafka, a cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple partitions, and each broker stores one or more of those partitions so that multiple producers and consumers can publish and retrieve messages at the same time.

Why is Kafka technology significant to use?

Kafka, being a distributed publish–subscribe system, has the following advantages:

Fast: Kafka comprises a broker, and a single broker can serve thousands of clients by handling megabytes of reads and writes per second.

Scalable: Data is partitioned and streamlined over a cluster of machines to enable large information.

Durable: Messages are persistent and is replicated in the cluster to prevent record loss.

Distributed by design: It provides fault-tolerance and robustness.




###

Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and streaming applications. It is designed to handle high-throughput, low-latency data processing and provides features for pub/sub messaging, data storage, and stream processing. Here are some key features and use cases of Kafka:


1. **Pub/Sub Messaging**:

   - Kafka acts as a distributed publish-subscribe messaging system where producers publish messages to topics, and consumers subscribe to topics to receive messages.

   - Use cases: Kafka can be used for real-time data ingestion, event-driven architectures, log aggregation, and decoupling of producers and consumers in distributed systems.


2. **Data Integration and ETL**:

   - Kafka provides reliable data transport and storage capabilities, making it suitable for building data integration and ETL (Extract, Transform, Load) pipelines.

   - Use cases: Kafka can be used to stream data from various sources (e.g., databases, IoT devices, logs) into data lakes, data warehouses, or analytics platforms for processing and analysis.


3. **Real-time Stream Processing**:

   - Kafka Streams, a lightweight stream processing library built on top of Kafka, allows developers to perform real-time processing of data streams.

   - Use cases: Kafka Streams can be used for real-time analytics, anomaly detection, fraud detection, monitoring, and alerting applications.


4. **Event Sourcing and CQRS** (Command Query Responsibility Segregation):

   - Kafka's append-only log structure makes it suitable for implementing event sourcing patterns, where changes to application state are captured as a series of immutable events.

   - Use cases: Kafka can be used for implementing event-driven architectures, event sourcing, and CQRS patterns in microservices and distributed systems.


5. **Log Aggregation and Monitoring**:

   - Kafka can be used to collect, aggregate, and store logs and metrics from various sources, making it a central hub for monitoring and observability.

   - Use cases: Kafka can be integrated with logging and monitoring systems (e.g., ELK stack, Prometheus) to collect, analyze, and visualize logs and metrics in real-time.


6. **IoT Data Ingestion**:

   - Kafka's high-throughput, low-latency messaging capabilities make it suitable for handling large volumes of data generated by IoT devices.

   - Use cases: Kafka can be used to ingest, process, and analyze sensor data, telemetry data, and event streams from IoT devices in real-time.


7. **Data Replication and High Availability**:

   - Kafka supports data replication and fault tolerance mechanisms to ensure high availability and data durability.

   - Use cases: Kafka can be deployed in distributed and multi-datacenter environments to replicate data across multiple brokers, ensuring fault tolerance and disaster recovery.


Overall, Kafka is a versatile and scalable platform that can be used for a wide range of use cases, including real-time data processing, event-driven architectures, log aggregation, monitoring, and IoT data ingestion. Its distributed and fault-tolerant architecture makes it well-suited for building mission-critical, high-performance streaming applications.

No comments:

Post a Comment