In theory, Apache Kafka is able to stream data in "real time." But "real time" is a complicated term. Like "calorie-free," which doesn't necessarily mean that there are literally no calories in food with that label, "real-time" data streams in Kafka don't always move data in actual real time. In some cases, there's a significant delay between the time when a Kafka producer pushes data and when a Kafka consumer receives it.

When that happens, you end up with Kafka consumer lag, which can seriously undercut the performance of your otherwise healthy Kafka cluster and degrade the experience of end-users who depend on the cluster. Slow Kafka consumers turn real-time Kafka data streams into a lie that negates the core purpose of using Kafka in the first place. It's sort of like how drinking diet soda actually correlates with weight gain, which is the opposite of what it's supposed to do.

Now, we're not here to give you diet advice or help you navigate the complex world of food labeling. What we can do is provide guidance on how to detect and fix Kafka consumer lag problems in order to keep your Kafka clusters doing what they’re supposed to do – stream data in real time, or as close to it as reasonably possible.

Understanding the Kafka consumer architecture

To do that, let's start by going over how the Kafka architecture works and where consumers fit into it.

As you probably know if you have at least basic familiarity with Kafka, Kafka is a distributed event-streaming platform. To use Kafka, you set up a cluster that includes three main types of components:

 • Producers, which are where streaming data originates.

 • Consumers, which are the recipients for streaming data.

 • Brokers, which manage transactions between producers and consumers so that the data can be streamed from the former to the latter – ideally in real time.

To help optimize data streaming, you can organize multiple Kafka consumers into consumer groups. When you do this, the consumers that belong to the group share in processing data (in other words, topics and partitions) that Kafka assigns to those consumers. Essentially, consumer groups make it possible to implement a type of load balancing for Kafka consumers by allowing consumers to share in the work of handling a given data stream.

Kafka consumer lag causes

When all goes well, each consumer or consumer group accepts and processes data rapidly, with minimal delays. In a healthy Kafka cluster, you should typically be seeing no more than several hundred milliseconds of lag between when a producer pushes data and when a consumer processes it.

But the best laid plans of Kafka consumers, just like those of mice and men, often go awry, leading to slow Kafka consumer performance. There are several common reasons why this happens.

Too many consumers, too little data

In general, configuring multiple consumers within a consumer group and allowing them to share in processing a Kafka topic is a good thing because it decreases the load placed on each individual consumer.

However, if you have too many consumers assigned to the same topic and too few partitions within that topic, Kafka may end up having to split the same partition among multiple consumers. That can slow things down because it increases the computational overhead that the Kafka cluster has to perform by deciding which parts of each topic to deliver to each consumer.

The solution here is simple: When deciding how many consumers to include in a consumer group, consider how many partitions you have in the topic you'll assign to that group and avoid creating more consumers than partitions.

Consumer code bottlenecks

Although all Kafka consumers do the same basic thing – process streaming data – you can configure exactly how individual consumers do that. If the code that tells your consumers how to behave is inefficient, you could end up with consumer bottlenecks.

For instance, if you have logic in your consumers that requires them to wait on one piece of data before they can begin processing another, you may run into processing bottlenecks in the event that the consumers sit idle while waiting on the additional data to come in, instead of using that time to begin processing.

Network and hardware issues

Problems with your infrastructure – including both your network and the servers that host your Kafka cluster – could lead to consumer lag. Flooding the network with too much data may cause some packets to be dropped and re-transmitted, which slows the speed at which they reach consumers. Lack of sufficient memory or CPU on the servers hosting consumers could cause them not to be able to process data as quickly as they should. Improperly configured memory and CPU limits on containers that host Kafka components may have the same effect.

Impact of the Kafka consumer lag problem

Kafka consumer lag can have a variety of negative effects on your overall cluster - let’s take a closer look at a few of them: 

Message backlog and a consumer lag "vicious cycles"

When a consumer stops processing messages as quickly as producers are generating them, a message backlog begins to build up as more and more messages sit unprocessed. The backlog continues to grow either until consumers are able to catch up or until the message queue is purged (which is typically a bad thing because purges mean you're deleting unprocessed data and could be missing out on important information).

This means that consumer lag can turn into a vicious cycle, where the problem gets worse and worse over time and it grows harder and harder for consumers to catch up.

Decreased system throughput and performance

Overall system throughput and performance also becomes increasingly worse over time due to consumer lag. The more messages you have in your backlog, the harder your cluster has to work to store those messages and deliver them to consumers, and the more messages you have flooding the network.

Delayed data processing and data loss

Slow consumers have the obvious effect of delaying data processing. If data is supposed to arrive in real time but there is a multi-second delay in how long it takes the consumer to process data, important information may not be received and processed in time to serve its intended purpose. An application that was supposed to make a real-time decision might end up making the decision based on data that is no longer up to date, leading to poor performance and a bad end-user experience.

Even worse, serious consumer lag can eventually lead to data loss in the event that the message backlog grows so large that there is no more room for storing additional messages, or that you decide to purge messages in order to get your consumers caught up.

Poor performance in downstream systems and data pipelines

The impact of slow Kafka consumer performance isn't limited to Kafka itself. It has a cascading effect that can degrade the performance of any downstream applications or data pipelines that depend on Kafka.

For example, imagine that you're a bank that uses Kafka to stream information about payment transactions to an application that analyzes that data to detect fraud. If Kafka consumers don't process the data quickly, the fraud detection engine may not be able to do its job fast enough to detect a fraudulent transaction before it’s complete and a thief has made off with the goods.

Monitoring and identifying slow consumers

Ideally, you won't wait until your end-users begin experiencing problems to detect slow Kafka consumer performance issues. You'll instead monitor consumer performance so that you can identify slow consumers quickly, before the problem snowballs and you end up with huge backlogs.

There are several ways to monitor and identify slow consumers in Kafka:

Offset Explorer

You can use the free Offset Explorer tool to monitor Kafka offsets. By comparing the offset of data within a consumer group to the current offset of the topic assigned to that group, you can effectively identify situations where there is a major mismatch between the two offset states – a condition that typically results from consumer lag.

Prometheus and other open source Kafka monitoring tools

If you want more context on when and why consumer lags exist, open source monitoring tools like Prometheus come in handy. You can use them to track the resource utilization and overall performance of the various parts of your Kafka cluster, helping to pinpoint slow consumer behavior and determine whether it's associated with lack of available resources or a different problem (like bottlenecks within consumer processing operations).

Proprietary Kafka monitoring services

Depending on where you host Kafka, you may also be able to take advantage of monitoring services designed to collect various metrics automatically from your Kafka cluster and help detect consumer performance issues (among other problems). For example, Amazon's Managed Kafka Streaming service integrates with CloudWatch and provides a built-in consumer lag monitoring feature to detect slow consumer performance.

Client-side Kafka monitoring with eBPF

With eBPF, you can monitor Kafka from the client side.  You can see what's happening from the perspective of both Producer and Consumer applications, and you can measure Producer-Consumer latency from each individual client.

At the same time, using eBPF means that you can integrate Kafka monitoring more elegantly into your broader monitoring strategy. The reason this works is that eBPF serves as the foundation for monitoring anything that runs within a Linux-based stack. So, using the same tooling, you can monitor not just Kafka data streams, but also the Kubernetes clusters that host the applications that depend on those data streams (for example). 

How to mitigate Kafka consumer lag problems

The best way to fix Kafka slow consumer issues depends, of course, on what the root cause of the issue is. You should use monitoring tools to determine the reason why your consumers are slowing down.

That said, there are some basic steps you can take that will mitigate most types of consumer lag in most situations.

Optimize consumer code and logic

As we noted above, inefficient logic inside consumers is a common source of lag. Review the routines you've configured in your consumers and make sure there are no bottlenecks that are slowing down message processing.

Improve resource allocation

Allocating more resources to your Kafka cluster is another way to mitigate many consumer performance issues. Although blindly tossing more memory and CPU at your servers or containers isn't a cost-effective or scalable way to fix problems that stem from other causes (like suboptimal consumer logic), it does have the effect of alleviating poor performance in any situation where the consumers simply lack enough resources to do their jobs well.

Network optimization and connectivity enhancements

Taking steps to improve network performance, too, can boost consumer performance. Remove any applications that are dumping unnecessary data to the network in order to reduce the risk of congestion. You can also trace Kafka traffic on your network using a tool like Wireshark to identify the location of any network bottlenecks that impede communication between producers and consumers.

Load balancing and parallel processing

Reviewing and optimizing load balancing and parallel processing configurations in Kafka is another way to reduce consumer lag. Again, although in general creating multiple consumers is a good thing because it helps to balance load, it's possible you have more consumers than you should based on your topics. You might also be able to improve performance by modifying the number of producers so that they can stream data more efficiently to consumers.

Tips for preventing Kafka slow consumer issues

Even better than mitigating Kafka slow consumers is preventing consumer lag from happening in the first place. Best practices to that end include:

 • Design an efficient cluster: When you set up your cluster, think strategically about how many producers and consumers to create, as well as how to organize partitions and topics. Having the right cluster architecture from the start does much to prevent slow consumers and other issues.

 • Optimize resource allocation and scaling: Along similar lines, take time when designing your cluster to estimate how many CPU and memory resources the components will require based on the number of messages you expect them to process. Then, assign resources accordingly. You can also plan ahead to scale your clusters by adding or removing brokers based on changes in demand.

 • Ensure a robust network: To optimize Kafka network performance, consider setting up a Virtual Private Cloud (VPC) or dedicated subnet for your Kafka cluster, which will isolate its network traffic from other resources and mitigate interference.

Keeping Kafka consumers healthy, wealthy and wise

If you use Kafka, you probably want your data to stream in real time or very close to it. But slow Kafka consumers can quickly turn real "real-time" streaming into another empty promise because you end up with backlogs, processing delays and, potentially, data loss.

Avoid this risk by continuously monitoring the performance of Kafka consumers – along with the rest of your Kafka cluster components – so that you can detect and react to lag quickly, before the real-time data streams that are supposed to be a pivotal part of overall application performance turn into your weakest link.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.

We care about data. Check out our privacy policy.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.