Just finished reading “I Heart Logs: Event Data, Stream Processing, and Data Integration”, written by Jay Kreps (who developed Apache Kafka).
Summary:
- One of the key challenges in distributed systems is reasoning about a global order of events (these events could be anything such as clicks on a webpage, transactions, price updates to a stock etc.). Keep in mind that in a distributed system, there is no global physical clock (for a good take on this, see ACMQueue)
- Logs can be used as a key abstraction in distributed systems, as they provide a journal of what happened in which order
- Logs have been valuable as a mechanism to replicate data across different databases for a long time
- Traditional ETL processing using hourly or daily data loads are less effective nowadays due to volume and velocity of data changes
- Enterprises can leverage Log based data pipeline by publishing all of its data to a central Log system
- Each application in that case is responsible for publishing data to the central Log pipeline in a canonical data model
- Downstream systems can subscribe to data feeds from the Log and consume data at their own pace
- Log based data processing helps to decouple producers and consumers, in the same way as traditional message queue does
- A system like Kafka, apart from providing a publish/subscribe model can also serve as a very useful buffer between producers and consumers