Kafka Topics, Partitions and Offsets Explained

Stephane Maarek

5 chapters6 takeaways8 key terms5 questions

Overview

This video introduces the fundamental concepts of Kafka: topics, partitions, and offsets. Topics serve as categories for data streams, analogous to tables in a database. Each topic is divided into partitions, which are ordered, append-only logs. Within each partition, messages are assigned a unique, incremental ID called an offset. The video explains how these components work together to manage and organize data streams, emphasizing that order and offset meaning are guaranteed only within a partition, not across them. It also touches upon data retention, immutability, and how messages are distributed to partitions.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

A Kafka topic is the primary way to categorize a stream of data, similar to a table in a relational database.
Topics are identified by a name, and you can have multiple topics within a Kafka system.
Topics are the fundamental unit for organizing data streams in Kafka.

Understanding topics is crucial because they are the main organizational structure for all data flowing through Kafka, acting as the central point for producers and consumers.

A topic named 'trucks_gps' to store location data from all trucks.

Topics are split into partitions, which are concrete, ordered logs.
When creating a topic, you must specify the number of partitions, though this can be changed later.
Each partition is assigned a sequential number starting from zero.

Partitions allow Kafka to scale horizontally by distributing data across multiple machines, enabling higher throughput and fault tolerance.

A topic can be created with three partitions, numbered 0, 1, and 2.

Within each partition, messages are assigned an incremental, ordered ID called an offset.
Offsets start at 0 for the first message in a partition and increase sequentially.
An offset only has meaning within the context of a specific partition; offset 0 in partition 0 is different from offset 0 in partition 1.

Offsets provide a unique identifier for each message within a partition, enabling consumers to track their progress and re-read messages if necessary.

The first message in partition 0 has offset 0, the next has offset 1, and so on.

Order is guaranteed only within a single partition, not across different partitions of the same topic.
Without a 'key' specified for a message, Kafka distributes messages randomly across available partitions.
The offset value of a message is only meaningful in conjunction with its partition number.

This understanding is vital for designing consumers that can correctly process messages, especially when dealing with distributed systems where cross-partition ordering is not guaranteed.

You can guarantee that offset 8 in partition 0 was written after offset 7, but you cannot compare the timing of messages between partition 0 and partition 1 without reading them.

Data in Kafka topics is immutable; once written, it cannot be changed or deleted.
Data is retained for a limited time, with a default retention period of one week, after which it is deleted.
Offsets continue to increment even after the associated data has been deleted.

These characteristics define how data is stored and accessed, influencing how long data is available and ensuring that messages are never lost or altered once published.

Even if messages from a week ago are deleted, their offsets will still exist and continue to increment for new messages.

Key takeaways

1Kafka topics act as named streams of data, analogous to database tables, for organizing information.
2Topics are divided into partitions to enable parallel processing and scalability.
3Offsets are sequential, incremental IDs assigned to messages within each partition, serving as unique identifiers.
4Message order and offset meaning are guaranteed only within a partition, not across partitions.
5Data in Kafka is immutable and has a configurable retention period, meaning it's eventually deleted but never modified.
6Producers can control message distribution to partitions by using keys, or messages will be distributed randomly if no key is provided.

Key terms

TopicPartitionOffsetStream of dataImmutabilityData retentionProducerConsumer

Test your understanding

1What is the primary function of a Kafka topic?
2How do partitions contribute to the scalability of Kafka?
3Why is an offset's meaning specific to a partition?
4What does it mean for data in Kafka to be immutable?
5How does Kafka handle message ordering across different partitions of the same topic?