Kafka

Apache Kafka is a distributed streaming platform which is used to build real time streaming data pipelines that can reliably get data between systems and applications. Kafka is run as a cluster on one or more servers. Kafka has 4 core APIs – Producer API, Consumer API, Streams API and Connector API.

Topic – A particular stream of data

  1. Similar to table in Database
  2. You can create as many topics as required
  3. A topic is known by the name given to it.
  4. Topic should have a replication factor greater than 1. So, this helps when one of the Brokers is down. The same partition of a topic is replicated on another broker.

Partitions –

  1. Topic are split in partitions.
  2. Each partition is ordered and starts from 0.
  3. Each partition has messages which get incremental ids called offset.

Offset –

  1. Offset only has a meaning for a specific partition.
  2. Offset 5 in partition 1 does not represent the same data in partition 6.
  3. Order is maintained only within one partition.
  4. Data is kept only for a limited period.
  5. Once the data is written, it cannot be changed
  6. Data is assigned randomly to a partition unless a key is provided.

Brokers –

  1. A kafka cluster is composed of multiple Brokers. Each broker is like a server.
  2. Each broker is identified with a ID
  3. Each broker contains only certain topic partitions.
  4. After connecting to any broker, you will be connecting to the entire cluster.
  5. All the partitions from a topic are distributed across the brokers.
  6. At any time only one broker acts as a leader for a given partition and that leader can receive and serve the data. The other brokers will synchronize the data. Therefore each partition has one leader and multiple in sync replicas.

Producers –

  1. Producers write data to topics which are made of partitions.
  2. Producers automatically know which broker and partition to write to.
  3. In case of Broker failures, Producers will automatically recover.
  4. Producers can choose to receive acknowledgement of data writes.
    1. Acks = 0 : Producer would not wait for acknowledgement and it would lead to a lot of data loss
    2. Acks = 1: Producer will wait for acknowledgement  and it will lead to a limited data loss.
    3. Acks = all: Leader and replicas will provide an acknowledgement and it will have no data loss.
  5. Producers can choose to send a key with a message. If the key is null, data is sent in round robin fashion. If the key is sent then the messages for that key will always go to the same partition.  A key is sent if you need ordering to your messages for a specific field.

Consumers –

  1. Consumers read data from a topic which is identified by a name.
  2. Consumers know which broker to read from.
  3. In case of broker failures, consumers know how to get to the replicas and consume messages.
  4. Data is read in an order within each partition of the topic.

Consumer Groups –

  1. Consumers read data in consumer groups.
  2. Each consumer in a group reads from a specific partition.
  3. If we have more consumers than partitions, then some consumers will sit idle.

Consumer Offsets –

  1. Kafka stores the offsets at which a consumer group has been reading.
  2. The offsets committed live in Kafka topic named __consumer_offsets
  3. Whenever a consumer group has processed the data received from Kafka, it will commit the offset.
  4. If the consumer dies, it should be able to read from where it was left.