Every good story has a beginning, and beginnings are arbitrary since one thing leads to another, so this story may as well start with a Kafka production incident. The root cause was innocuous, something that was easily fixed (a failed network switch), but since one thing leads to another, one by one, the dominoes fell over. An old operating system with a bug at the network layer, a long un-upgraded broker cluster with a bug that had been fixed years earlier, an old version of a library that had since received fixes for this exact situation… You get the pattern. Things that likely wouldn’t have happened with good operational hygiene.
But then the main sending application was affected – it was accepting unconstrained traffic from the outside world. Since it couldn’t send because the Kafka cluster was unreachable, it eventually ran out of memory and stalled. The customer made a hard decision, and once the cluster situation was stabilised, they weighed up their options and decided to restart the service – they lost the unsent data in application memory and a not-insignificant amount of money.
We spent a couple of days looking in detail at the application architecture and found two fundamental problems:
1. The designers had not factored in that the broker cluster might be unreachable for an extended period and would have to deal with inbound traffic somehow in those situations.
2. They were using the librdkafka client library (used in C, C++, .Net, and many other language environments) in a way that it wasn’t designed to be used. There were gaps in handling errors because corner cases weren’t fully considered, and the application was inadvertently causing feedback loops that worsened the data backlog.
Yesterday, a post of mine was published on the Confluent blog that covers the intricacies of how this library works and how it should be used – “A Deep Dive Into Sending With librdkafka” .
This is the second in a two-part series along with “How to Survive an Apache Kafka Outage” informed by this incident. Taken together, the two posts are an essential read for anyone working with high-performance systems with limited control of their inbound data rate that must deal with Kafka service interruptions.
Many thanks to Magnus Edenhill for helping me understand and articulate the behaviour of this library.
If you’re interested in the Byzantine failure modes of distributed systems, I highly recommend reading “The network is reliable” – it will make you question your professional life choices.