In theory, Apache Kafka is able to stream data in "real time." But "real time" is a complicated term. Like "calorie-free," which doesn't necessarily mean that there are literally no calories in food with that label, "real-time" data streams in Kafka don't always move data in actual real time. In some cases, there's a significant delay between the time when a Kafka producer pushes data and when a Kafka consumer receives it.
When that happens, you end up with Kafka consumer lag, which can seriously undercut the performance of your otherwise healthy Kafka cluster and degrade the experience of end-users who depend on the cluster. Slow Kafka consumers turn real-time Kafka data streams into a lie that negates the core purpose of using Kafka in the first place. It's sort of like how drinking diet soda actually correlates with weight gain, which is the opposite of what it's supposed to do.
Now, we're not here to give you diet advice or help you navigate the complex world of food labeling. What we can do is provide guidance on how to monitor Kafka consumer lag and fix Kafka consumer lag problems in order to keep your Kafka clusters doing what they’re supposed to do – stream data in real time, or as close to it as reasonably possible.
What is Kafka consumer lag?
In Apache Kafka, consumer lag is a delay in the time it takes a message to move from a producer (which generates messages) to a consumer (which receives them). Put another way, Kafka consumer lag measures how quickly messages flow within a Kafka data stream.
Some amount of lag is inevitable because it will always take some amount of time for data to move between producers and consumers. But in a well-designed, well-managed Kafka cluster, lag should be minimal – typically, just a handful of milliseconds.
Key Kafka consumer lag concepts
In the context of Kafka consumer lag, lag is shorthand for latency. As you may know if you have experience managing networks, latency is a generic term that refers to delays in transmitting information. On a computer network, engineers typically measure latency by tracking how long it takes network traffic to move between its point of origin and its destination. The higher the latency, the worse the performance of the network.
Consumer lag in Kafka is based on the same premise: The longer it takes for messages to move, the worse the performance of your Kafka data stream. For this reason, it is important to monitor Kafka consumer lag as one step in assessing overall Kafka performance and ensuring that you are able to support your intended Kafka use cases.
Understanding the Kafka consumer architecture
To mitigate Kafka consumer lag, you must first understand the Kafka architecture – because problems with various components inside the architecture often contribute to consumer lag.
As you probably know if you have at least basic familiarity with Kafka, Kafka is a distributed event-streaming platform. To use Kafka, you set up a cluster that includes three main types of components:
- Producers, which are where streaming data originates.
- Consumers, which are the recipients for streaming data.
Brokers, which manage transactions between producers and consumers so that the data can be streamed from the former to the latter – ideally in real time.
To help optimize data streaming, you can organize multiple consumers into Kafka consumer groups. When you do this, the consumers that belong to the group share in processing data (in other words, topics and partitions) that Kafka assigns to those consumers. Essentially, Kafka consumer groups make it possible to implement a type of load balancing for Kafka consumers by allowing consumers to share in the work of handling a given data stream.
What causes Kafka consumer lag?
When all goes well, all consumers and Kafka consumer groups accept and process data rapidly, with minimal delays. In a healthy Kafka cluster, you should typically be seeing no more than several hundred milliseconds of lag between when a producer pushes data and when a consumer processes it.
But the best laid plans of Kafka consumers, just like those of mice and men, often go awry, leading to slow Kafka consumer performance. There are several common reasons why this happens.
Too many consumers, too little data
In general, configuring multiple consumers within a consumer group and allowing them to share in processing a Kafka topic is a good thing because it decreases the load placed on each individual consumer.
However, if you have too many consumers assigned to the same topic and too few partitions within that topic, Kafka may end up having to split the same partition among multiple consumers. That can slow things down because it increases the computational overhead that the Kafka cluster has to perform by deciding which parts of each topic to deliver to each consumer.
The solution here is simple: When deciding how many consumers to include in a consumer group, consider how many partitions you have in the topic you'll assign to that group and avoid creating more consumers than partitions.
Partition skew
Along similar lines, you may experience increases in Kafka lag if you don’t distribute partitions evenly across Kafka brokers. This leads to what’s known as partition skew or data skew. When this happens, data processing is not balanced properly across consumers – so even if your total number of producers and consumers should be adequate to handle the volume of data you want to stream, they may not be able to keep up because some consumers are being under-utilized.
To fix this issue, redistribute partitions across your brokers such that all resources are used efficiently.
Consumer code bottlenecks
Although all Kafka consumers do the same basic thing – process streaming data – you can configure exactly how individual consumers do that. If the code that tells your consumers how to behave is poorly designed, you could end up with consumer bottlenecks.
For instance, if you have logic in your consumers that requires them to wait on one piece of data before they can begin processing another, you may run into processing bottlenecks in the event that the consumers sit idle while waiting on the additional data to come in, instead of using that time to begin processing.
Slow processing logic
Sometimes, the logic within a Kafka consumer leads to lag due to inefficient data processing. This is different from consumer code bottlenecks because the issue isn’t that a consumer has to wait on one event before it can proceed with another operation. It’s that the way it does things is inefficient. For instance, your consumer logic might break a certain process out into multiple steps, whereas combining those steps into a single operation would be faster and more efficient.
If you monitor consumer lag and notice high lag for certain consumers that does not correlate with other obvious causes (such as traffic spikes), slow processing logic within the consumer could be the root cause of the issue.
Since this is a problem that stems from the code you used to create the consumer, the solution is to optimize the consumer’s code.
Errors caused by bugs in code
Imperfections in code within consumers can also trigger errors. For example, a consumer might fail to process a certain type of message due to buggy code. As a result, the producer may attempt to resend the data, only to have the consumer fail to process it again, and so on.
The effect of this error is that it ties up resources trying to process messages that never succeed. This can contribute to lag because there are fewer resources available to handle other messages.
Network and hardware issues
Problems with your infrastructure – including both your network and the servers that host your Kafka cluster – could lead to consumer lag. Flooding the network with too much data may cause some packets to be dropped and re-transmitted, which slows the speed at which they reach consumers. Lack of sufficient memory or CPU on the servers hosting consumers could cause them not to be able to process data as quickly as they should. Improperly configured memory and CPU limits on containers that host Kafka components may have the same effect.
Sudden spike in incoming traffic
A sudden increase in the volume of messages generated by a Kafka consumer may lead to lag. The reason why is fairly simple: If the number of messages that a producer is spitting out spikes unexpectedly, the consumer can’t receive all of the data without delay.
This isn’t to say that Kafka can’t handle fluctuations in traffic. It certainly can, with the right planning. But if you anticipate sudden spikes in traffic, you’ll want to ensure that you have enough consumers available to accommodate the traffic volumes, even if they are not consistent.
Impact of the Kafka consumer lag problem
Kafka consumer lag can have a variety of negative effects on your overall cluster - let’s take a closer look at a few of them:
Message backlog and a consumer lag "vicious cycles"
When a consumer stops processing messages as quickly as producers are generating them, a message backlog begins to build up as more and more messages sit unprocessed. The backlog continues to grow either until consumers are able to catch up or until the message queue is purged (which is typically a bad thing because purges mean you're deleting unprocessed data and could be missing out on important information).
This means that consumer lag can turn into a vicious cycle, where the problem gets worse and worse over time and it grows harder and harder for consumers to catch up.
Decreased system throughput and performance
Overall system throughput and performance also become increasingly worse over time due to consumer lag. The more messages you have in your backlog, the harder your cluster has to work to store those messages and deliver them to consumers, and the more messages you have flooding the network.
Delayed data processing and data loss
Slow consumers have the obvious effect of delaying data processing. If data is supposed to arrive in real time but there is a multi-second delay in how long it takes the consumer to process data, important information may not be received and processed in time to serve its intended purpose. An application that was supposed to make a real-time decision might end up making the decision based on data that is no longer up to date, leading to poor performance and a bad end-user experience.
Even worse, serious consumer lag can eventually lead to data loss in the event that the message backlog grows so large that there is no more room for storing additional messages, or that you decide to purge messages in order to get your consumers caught up.
Poor performance in downstream systems and data pipelines
The impact of slow Kafka consumer performance isn't limited to Kafka itself. It has a cascading effect that can degrade the performance of any downstream applications or data pipelines that depend on Kafka.
For example, imagine that you're a bank that uses Kafka to stream information about payment transactions to an application that analyzes that data to detect fraud. If Kafka consumers don't process the data quickly, the fraud detection engine may not be able to do its job fast enough to detect a fraudulent transaction before it’s complete and a thief has made off with the goods.
How to monitor Kafka consumer lag
Ideally, you won't wait until your end-users begin experiencing problems to detect slow Kafka consumer performance issues. You'll instead monitor consumer performance so that you can identify slow consumers quickly, before the problem snowballs and you end up with huge backlogs.
There are several ways to monitor and identify slow consumers in Kafka:
Offset Explorer
You can use the free Offset Explorer tool to monitor Kafka offsets. By comparing the consumer offset of data within a consumer group to the current consumer offset of the topic assigned to that group, you can effectively identify situations where there is a major mismatch between the two offset states – a condition that typically results from consumer lag.
Prometheus and other open source Kafka monitoring tools
If you want more context on when and why consumer lags exist, open source monitoring tools like Prometheus come in handy. You can use them to track the resource utilization and overall performance of the various parts of your Kafka cluster, helping to pinpoint slow consumer behavior and determine whether it's associated with a lack of available resources or a different problem (like bottlenecks within consumer processing operations).
Proprietary Kafka monitoring services
Depending on where you host Kafka, you may also be able to take advantage of monitoring services designed to collect various metrics automatically from your Kafka cluster and help detect consumer performance issues (among other problems). For example, Amazon's Managed Kafka Streaming service integrates with CloudWatch and provides a built-in consumer lag monitoring feature to detect slow consumer performance.
Client-side Kafka monitoring with eBPF
With eBPF, you can monitor Kafka from the client side. You can see what's happening from the perspective of both producer and consumer applications, and you can measure producer-consumer latency from each individual client.
At the same time, using eBPF means that you can integrate Kafka monitoring more elegantly into your broader monitoring strategy. The reason this works is that eBPF serves as the foundation for monitoring anything that runs within a Linux-based stack. So, using the same tooling, you can monitor not just Kafka data streams, but also the Kubernetes clusters that host the applications that depend on those data streams (for example).
How to mitigate Kafka consumer lag problems
The best way to fix Kafka slow consumer issues depends, of course, on what the root cause of the issue is. You should use monitoring tools to determine the reason why your consumers are slowing down.
That said, there are some basic steps you can take that will mitigate most types of consumer lag in most situations.
Optimizing consumer code and processing logic
As we noted above, inefficient logic inside consumers is a common source of lag. Review the routines you've configured in your consumers and make sure there are no bottlenecks that are slowing down message processing.
Adjust consumer configuration parameters
As an alternative to modifying processing logic, changing configuration parameters for consumers may help improve performance. Kafka supports a variety of consumer configuration parameters that can control options like how partitions are assigned, how much data consumers fetch, and how long to wait between fetch requests. In some cases, poor configuration choices can slow down the transmission of messages to consumers that otherwise perform well.
Improving resource allocation
Allocating more resources to your Kafka cluster is another way to mitigate many consumer performance issues. Although blindly tossing more memory and CPU at your servers or containers isn't a cost-effective or scalable way to fix problems that stem from other causes (like suboptimal consumer logic), it does have the effect of alleviating poor performance in any situation where the consumers simply lack enough resources to do their jobs well.
Network optimization and connectivity enhancements
Taking steps to improve network performance, too, can boost consumer performance. Remove any applications that are dumping unnecessary data to the network in order to reduce the risk of congestion. You can also trace Kafka traffic on your network using a tool like Wireshark to identify the location of any network bottlenecks that impede communication between producers and consumers.
Load balancing and parallel processing
Reviewing and optimizing load balancing and parallel processing configurations in Kafka is another way to reduce consumer lag. Again, although in general creating multiple consumers is a good thing because it helps to balance load, it's possible you have more consumers than you should based on your topics. You might also be able to improve performance by modifying the number of producers so that they can stream data more efficiently to consumers.
Modifying partition count
As we mentioned, partitions that are not properly distributed across Kafka brokers can lead to lag. Modifying the partition count can help fix this issue. You can remove partitions that aren’t carrying their weight, which frees up resources that other partitions can use to improve performance. Or, you can add partitions to help redistribute load and avoid bottlenecks in other partitions.
Using application-specific queues
In some cases, setting up queues for specific applications can help to improve lag in Kafka. This makes it possible to push messages into queues and have consumers wait until they are processed before accepting new messages.
This approach, which helps avoid bottlenecking, is essentially a workaround for slow consumers. It’s especially helpful in situations where you can’t make your consumers more efficient but don’t want to worry about slowing down all messages waiting on a consumer to catch up.
Tips for preventing Apache Kafka consumer lag
Even better than mitigating Kafka slow consumers is preventing consumer lag from happening in the first place. Best practices to that end include:
- Design an efficient cluster: When you set up your cluster, think strategically about how many producers and consumers to create, as well as how to organize partitions and topics. Having the right cluster architecture from the start does much to prevent slow consumers and other issues.
- Optimize resource allocation and scaling: Along similar lines, take time when designing your cluster to estimate how many CPU and memory resources the components will require based on the number of messages you expect them to process. Then, assign resources accordingly. You can also plan ahead to scale your clusters by adding or removing brokers based on changes in demand.
- Ensure a robust network: To optimize Kafka network performance, consider setting up a Virtual Private Cloud (VPC) or dedicated subnet for your Kafka cluster, which will isolate its network traffic from other resources and mitigate interference.
Keeping Kafka consumers healthy, wealthy, and wise
If you use Apache Kafka, you probably want your data to stream in real time or very close to it. But slow Kafka consumers can quickly turn real "real-time" streaming into another empty promise because you end up with backlogs, processing delays and, potentially, data loss.
Avoid this risk by continuously monitoring the performance of Kafka consumers – along with the rest of your Kafka cluster components – so that you can detect and react to lag quickly, before the real-time data streams that are supposed to be a pivotal part of overall application performance turn into your weakest link.
Sign up for Updates
Keep up with all things cloud-native observability.