Kubernetes Monitoring 101: Challenges & Best Practice
Dive into Kubernetes metrics, and learn why Kubernetes monitoring matters to properly assess your workload's resilience and robustness.
Those just getting started on their Kubernetes journey may encounter an inconvenient truth: They need to monitor Kubernetes. But whether it’s due to limitations in the available tools, or the ability to correlate Kubernetes events with application events, monitoring Kubernetes doesn’t come easy.
If you find yourself in that boat, then do read on and join us for an introduction to Kubernetes monitoring – including not just how it works and why it's important, but also how to take the pain out of Kubernetes monitoring.
What is Kubernetes monitoring?
Simply put, Kubernetes monitoring is the practice of tracking the status of all components of a Kubernetes environment. Because there are many pieces inside Kubernetes, Kubernetes monitoring actually entails monitoring many distinct things, such as:
• The kube-system workloads.
• Cluster information using the Kubernetes API.
• Applications interactions with Kubernetes by monitoring apps bottom-up.
By collecting Kubernetes data, you’ll get viable information regarding your Kubernetes cluster health, that can help you perform Kubernetes troubleshooting and manage issues like unexpected container termination. You can also leverage the data for proactive decisions such as adjusting rate limits.
Read more about Kubernetes Logging
With that said, a Kubernetes application performance monitoring strategy that hinges on monitoring the Kubernetes infrastructure individually is not enough. You'll also want to be able to correlate the Kubernetes data with your application metrics to get a clear image of your Kubernetes cluster, and to pinpoint the root cause of issues that may involve multiple components (such as a Pod that has failed to start because there isn't an available node to host it).
Read more about Kubernetes Tracing
With groundcover, it’s easy to correlate Kubernetes issues with the deployed applications
Why is it important to monitor Kubernetes?
As you read about monitoring Kubernetes in the section above, you may have thought to yourself: “does Kubernetes monitoring really make the difference when assessing application health?” or “ Is it worth going beyond monitoring my workloads in terms of tracking down an issue?”
I’d say the short answer to those questions is: Absolutely. Although Kubernetes can be summarized as a container orchestration platform, it's a layer that constantly interacts with the deployed applications through their entire lifetime, and, like with anything, ‘the devil is in the details’. In our case - those details are often:
How Kubernetes interacts with our applications (e.g. kube-proxy problems, affecting services connectivity).
How Kubernetes behaves alongside the workloads (e.g. unresponsive node agents, affecting workload scheduling).
That's why it's critical to monitor Kubernetes using an approach tailored to Kubernetes. If you don't – if you monitor K8s using the same tools and methods that you'd use to monitor applications and infrastructure outside of a Kubernetes cluster and don’t take Kubernetes into account – you almost certainly won't be able to properly assess your workload's resilience and robustness.
How to monitor Kubernetes, effectively
So, what does it take to monitor a Kubernetes environment effectively?
Perhaps the best way to answer that question is first to talk about how not to monitor Kubernetes. You should not do things like:
• Settle for Kubernetes monitoring based on limited metrics or only metrics of a certain type – such as basic CPU and memory resource usage data.
• Monitor Kubernetes as a separate being from your workloads, not correlating their metrics.
Instead, you should implement a Kubernetes monitoring strategy that allows you to collect any and all relevant data from across all parts of your Kubernetes cluster in a centralized way. You can do that seamlessly with the help of eBPF, but I’ll delve into that later, after we set down the groundwork…
groundcover automatically deploys eBPF agents in your Kubernetes clusters, and centralizes your data so you can easily integrate in with your existing pipelines.
Key Kubernetes metrics to measure in Kubernetes monitoring
Let's begin that groundwork by talking about which Kubernetes metrics you should actually monitor. We'll break these metrics down into two sections – cluster monitoring and Pod monitoring – because each corresponds to a different "layer" of a Kubernetes environment.
Cluster monitoring
Cluster monitoring means monitoring resources that apply to your cluster as a whole. Key Kubernetes metrics to track across your Kubernetes clusters include:
• Total CPU usage: Knowing how much CPU your Kubernetes cluster is using relative to the total available is critical because if you run out of available CPU, your workloads may begin to fail.
• Total memory usage: Likewise, you want to know how much spare memory your cluster has so that you can avoid failures related to running out of memory.
• Total disk usage: If you're deploying stateful applications on Kubernetes – or even if you're not and just need some persistent storage resources for data produced by your cluster, such as your etcd keys and values – you should monitor total available disk space so you don't run short.
In addition to knowing resource utilization rates for your cluster as a whole, you'll ideally be able to break down the Kubernetes cluster metrics on a node-by-node basis so you can identify individual nodes that are running short of CPU, memory and so on. This granularity is important because although Kubernetes will automatically attempt to move around workloads between nodes if some nodes run short of spare resources, it can't always do that successfully, and you may run into problems such as crashed nodes or apps if some nodes are maxed out on resources even if others are not.
Read more about Kubernetes Cost Optimization
Pod monitoring
For monitoring Pods – the type of Kubernetes object in which applications usually live – you'll also want to track CPU, memory and (for stateful apps) disk usage, although you'll do so for somewhat different purposes:
• Pod CPU utilization: Tracking Pod CPU metrics over time allows you to identify unusual events, such as a sudden spike in resource consumption, that may be the sign of a problem. It also allows you to get an early warning if CPU usage is approaching the maximum available, in which case you should allocate more CPU to your Pod or move your Pod to a node with more CPU.
• Pod memory utilization: What is true of Pod CPU metrics is also true of Pod memory metrics: Measuring memory utilization over time provides a window into Pod health and helps avoid out-of-memory issues for Pods and containers.
• Pod disk utilization: If containers in your Pod require stateful storage, you should track disk utilization metrics for each Pod so you know they have sufficient storage available.
In addition to tracking these Pod metrics, you should also track Pod state. Pod state is not a type of metric, but it can help you understand Pod behavior, and you can correlate changes in Pod state with Pod metrics to gain additional context on problems. For example, if CPU usage spikes when a container within a Pod is stuck in the pending state, it's more likely than not that a bug in the container is triggering high CPU usage and causing the Pod to fail to enter the running state successfully.
Types of metrics collection methods
As for actually collecting cluster and Pod metrics, there are several possible approaches:
- Using the Kubernetes Metrics Server, which generates metrics about various objects in Kubernetes. (The Metrics Server is a replacement for Heapster, a tool that was used for the same purpose in the past.) The challenge here is that the Metrics Server is not installed by default on many Kubernetes distributions. It also only supports a limited range of metric types.
- Running a monitoring agent in a sidecar container alongside all containers that you want to monitor. This approach lets you collect metrics from containers and Pods. The downside is that you have to configure a sidecar for each workload you want to monitor, and each sidecar increases the resource consumption of your workloads. Also, sidecars only work for monitoring containers and Pods, not the Kubernetes cluster as a whole (you can use a DaemonSet for cluster metrics).
- Using a DaemonSet to collect Kubernetes cluster metrics. With a DaemonSet, you can deploy a monitoring agent on each node in your cluster, making it possible to collect metrics from all nodes – and therefore for the Kubernetes cluster as a whole. The challenge here, though, is the reverse of the limitation of sidecar containers that we just mentioned: DaemonSet-based monitoring only allows you to collect cluster metrics, not Pod metrics. DaemonSets also increase the CPU and memory utilization of your nodes.
- Using eBPF, which lets you collect metrics for your cluster and Pods alike in a hyper-efficient way. We'll talk about eBPF more below, but suffice it to say that we think it's a fundamentally better way to collect metrics in Kubernetes because one tool lets you collect metrics of all types, with minimal infrastructure overhead.
Kubernetes monitoring challenges
The main challenge you're likely to face with Kubernetes monitoring is that Kubernetes includes many different components, and collecting metrics for all of them can be tough, as outlined in the preceding section, which walked through various metrics collection methods and explained how most of them (eBPF not included!) only let you collect certain types of metrics, and/or only support certain types of sources.
On top of this, monitoring in Kubernetes can be tough because collecting the data is only half of the battle. You must also correlate your various metrics to gain the context necessary to find and fix problems.
For example, knowing that a node's CPU utilization has spiked to 100 percent doesn't tell you much other than that the node is using a lot of CPU. To monitor the node effectively in this case, you'd probably want to look at metrics from the Pods running on the node to determine which one(s) are experiencing high CPU utilization, when the high rate of CPU consumption started and which events (such as a new Pod moving to the node) correlate with the change. You would also want to check whether any of the Pods are being forced to run on that specific node, or if Kubernetes placed them there automatically.
Knowing all of the above provides the context to determine whether you have a buggy container in one of your Pods that is wasting CPU on the node, or whether the node is simply maxed out because it is hosting too many CPU-hungry Pods – in which case you may want to move some Pods to different nodes, or increase the CPU allocation for the node (assuming it's a virtual machine and you can do that).
Kubernetes monitoring best practices
In order to leverage Kubernetes data effectively as an integral part of your workload monitoring, here are some Kubernetes monitoring best practices we encourage:
- Correlate data: Again, because Kubernetes has so many layers and components, the ability to correlate monitoring data between different types of resources is critical. Monitoring and analyzing data from individual resources in isolation is often not enough to determine the root cause of a failure or assess how many resources it impacts.
- Configure contextual alerts: Generic threshold-based alerting – which means generating alerts whenever resource utilization crosses a predefined level – doesn’t typically work well in Kubernetes because Kubernetes workloads often scale up and down on a continuous basis. Instead, you should configure alerts that take context into account. For example, a workload instability that is due to a temporary cluster re-sizing might be treated with a lower severity.
- Analyze data in real time: Because Kubernetes clusters are constantly changing, analyzing data even just minutes after it was collected may not be enough to deliver actionable insights. You want to be able to ingest and analyze data in real time whenever possible.
- Keep monitoring predictable: As we mentioned, monitoring tools can consume a lot of resources and due to their sometimes “injected” nature (sidecars) - can deprive your production workloads of the resources they need to run well, and create ambiguity regarding resource consumption. Avoid this problem by choosing a monitoring architecture that can be tailored to your needs and is focusing on safety and consistency, eBPF powered Kubernetes observability is a great way to achieve this.
Top Kubernetes monitoring solutions: The old way and the good way
Unfortunately, conventional Kubernetes monitoring solutions – those that use data collection methods other than eBPF – don't always lend themselves well to monitoring best practices.
These solutions are essentially efforts to port traditional monitoring practices to Kubernetes without taking cloud-native considerations (such as cluster elasticity, Kubernetes RBAC and workload replication) into account. Sometimes, they use Kubernetes practices such as sidecar injection that should be done with caution and are not always suited for long-running observability due to the extra resource utilization that they cause. These approaches also treat Kubernetes infrastructure and workloads as if they are separate entities that don’t interact.
So, with non-native or conventional Kubernetes monitoring, you end up having to deploy multiple tools to collect monitoring data. None of them are particularly efficient.
Kubernetes monitoring made easy: The eBPF approach
Fortunately, there's a better solution to Kubernetes monitoring: eBPF.
eBPF is a framework that allows you to run programs in the kernel space of Linux-based servers. What that means, in essence, is that eBPF makes it possible to deploy monitoring software (among other types of tools) that is very efficient and secure, but that also provides very granular visibility into workloads.
The grand idea behind eBPF-based Kubernetes monitoring is that if you can run monitoring agents in kernel space on each node, you can use them as a vantage point for collecting any data you want in your Kubernetes cluster, not just your own workloads - but Kubernetes workloads as well. All of the data generated by these resources passes through the kernel of the operating system that hosts them, so there is virtually no limit to what you can monitor using eBPF.
eBPF is still pretty new – it debuted only in 2014, and it has taken some time to gain widespread adoption – so it wasn't always at the center of Kubernetes monitoring. But now that it has matured, eBPF has emerged and redefined the possibilities when it comes to monitoring Kubernetes clusters. It has opened up a radically simpler, more efficient, and more effective approach. And it enables a Kubernetes monitoring strategy that is truly tailored for the distributed nature of Kubernetes.
Gain with no pain: Monitoring Kubernetes without losing it
To sum up, Kubernetes monitoring has traditionally been tricky. There were so many types of resources to monitor, and so many different types of data to collect and correlate from each one, that there wasn't a great way of getting all of the data you needed to manage Kubernetes as part of a comprehensive fleet management.
Luckily, with a little help from eBPF, these problems disappear. Kubernetes monitoring based on eBPF makes it possible to get all of the information you need, across all Kubernetes resource types, in an efficient, consistent, secure and contextual way.
eBPF Academy
Related content
Sign up for Updates
Keep up with all things cloud-native observability.