Apache Kafka is a distributed streaming platform which is used to build real time streaming data pipelines that can reliably get data between systems and applications. Kafka is run as a cluster on one or more servers. Kafka has 4 core APIs – Producer API, Consumer API, Streams API and Connector API.
Topic – A particular stream of data
- Similar to table in Database
- You can create as many topics as required
- A topic is known by the name given to it.
- Topic should have a replication factor greater than 1. So, this helps when one of the Brokers is down. The same partition of a topic is replicated on another broker.
Partitions –
- Topic are split in partitions.
- Each partition is ordered and starts from 0.
- Each partition has messages which get incremental ids called offset.
Offset –
- Offset only has a meaning for a specific partition.
- Offset 5 in partition 1 does not represent the same data in partition 6.
- Order is maintained only within one partition.
- Data is kept only for a limited period.
- Once the data is written, it cannot be changed
- Data is assigned randomly to a partition unless a key is provided.
Brokers –
- A kafka cluster is composed of multiple Brokers. Each broker is like a server.
- Each broker is identified with a ID
- Each broker contains only certain topic partitions.
- After connecting to any broker, you will be connecting to the entire cluster.
- All the partitions from a topic are distributed across the brokers.
- At any time only one broker acts as a leader for a given partition and that leader can receive and serve the data. The other brokers will synchronize the data. Therefore each partition has one leader and multiple in sync replicas.
Producers –
- Producers write data to topics which are made of partitions.
- Producers automatically know which broker and partition to write to.
- In case of Broker failures, Producers will automatically recover.
- Producers can choose to receive acknowledgement of data writes.
- Acks = 0 : Producer would not wait for acknowledgement and it would lead to a lot of data loss
- Acks = 1: Producer will wait for acknowledgement and it will lead to a limited data loss.
- Acks = all: Leader and replicas will provide an acknowledgement and it will have no data loss.
- Producers can choose to send a key with a message. If the key is null, data is sent in round robin fashion. If the key is sent then the messages for that key will always go to the same partition. A key is sent if you need ordering to your messages for a specific field.
Consumers –
- Consumers read data from a topic which is identified by a name.
- Consumers know which broker to read from.
- In case of broker failures, consumers know how to get to the replicas and consume messages.
- Data is read in an order within each partition of the topic.
Consumer Groups –
- Consumers read data in consumer groups.
- Each consumer in a group reads from a specific partition.
- If we have more consumers than partitions, then some consumers will sit idle.
Consumer Offsets –
- Kafka stores the offsets at which a consumer group has been reading.
- The offsets committed live in Kafka topic named __consumer_offsets
- Whenever a consumer group has processed the data received from Kafka, it will commit the offset.
- If the consumer dies, it should be able to read from where it was left.