Kafka is a distributed system that runs on a cluster with many computers. Producers are the programs that feeds kafka brokers. A broker is a kafka server which stores/keeps/maintains incoming messages in files with offsets. Each message is stored in a file with an index , actually this index is an offset. Consumers are the programs which consumes the given data with offsets. Kafka does not deletes consumed messages with its default settings. Messages stays still persistent for 7 days with default configuration of Kafka.
Topics are the contracts between the producer and consumers. Producer pushes messages to a specific topic and consumers consumes messages from that specific topic. There can be many different topics and Consumer Groups which are consuming data from a specific topic according to your business requirements.
If topic is very large (larger than the storage capacity of a computer) then kafka can use other computers as a cluster node to store that large topic. Those nodes are called as partition in kafka terminology. So each topic may be divided in many partitions (computers) in the Kafka Cluster.
So how kafka make decisions on storing and dividing large topics in to the partitions ? Answer : It does not !
So you should make calculations according to your producer/consumer load, message size and storage capacity. This concept is called as partitioning. Kafka expect you to implement Partitioner interface. Of course there is a default partitioner (SimplePartitioner implements Partitioner).
As a result; A topic is divided into many partitions and each partition has its own OffSet and each consumer reads from different partitions but many consumers can not read from same partition to prevent duplicated reads.
If you need to collect specific messages on the specific partition you should use message keys. Same keys are always collected in the same partition. By the way those messages are consumed by the same Consumer.
Topic partitioning is the key for parallelism in Kafka. On both the producer and the broker side, writes to different partitions can be done fully in parallel. So it is important to have much partitions for a better Performance !
Please see the below use case which shows semantics of a kafka server for a game application's events for some specifics business requirements. All the messages related to a specific game player is collected on the particular partition. UserIds are used as message keys here.
I tried to briefly summarize it... I am planning to add more post about data pipelining with:
Hadoop->Kafka->Spark | Storm etc..