I U+2665 Logs

Just finished reading “I Heart Logs: Event Data, Stream Processing, and Data Integration”, written by Jay Kreps (who developed Apache Kafka).

Summary:

  • One of the key challenges in distributed systems is reasoning about a global order of events (these events could be anything such as clicks on a webpage, transactions, price updates to a stock etc.). Keep in mind that in a distributed system, there is no global physical clock (for a good take on this, see ACMQueue)
  • Logs can be used as a key abstraction in distributed systems, as they provide a journal of what happened in which order
  • Logs have been valuable as a mechanism to replicate data across different databases for a long time
  • Traditional ETL processing using hourly or daily data loads are less effective nowadays due to volume and velocity of data changes
  • Enterprises can leverage Log based data pipeline by publishing all of its data to a central Log system
  • Each application in that case is responsible for publishing data to the central Log pipeline in a canonical data model
  • Downstream systems can subscribe to data feeds from the Log and consume data at their own pace
  • Log based data processing helps to decouple producers and consumers, in the same way as traditional message queue does
  • A system like Kafka, apart from providing a publish/subscribe model can also serve as a very useful buffer between producers and consumers

Comments are closed.