Kubernetes Disk Pressure: Common Causes & How to Fix Them
Kubernetes disk pressure can pose a challenge for K8s admins. Find out how to prevent disk pressure issues from undercutting the performance of workloads.
In a perfect world, Kubernetes nodes would have endless supplies of storage, and Pods and containers would never run out of sufficient disk space.
But in the real world, node storage resources are always finite – which is why Kubernetes disk pressure can become a challenge. Nodes can't simply generate more disk space out of thin air, and when they run short of available storage capacity, disk pressure results.
The good news is that, with the right Kubernetes monitoring and management practices, Kubernetes admins can prevent disk pressure issues from undercutting the performance of their workloads. Keep reading as we explain everything you need to know about node disk pressure on Kubernetes.
What is Kubernetes node disk pressure?
In Kubernetes, node disk pressure is a condition where a node begins to run out of available disk space. When this happens, Kubernetes can evict Pods from the node (through a process known as node-pressure eviction) to free up available space on the node.
Importantly, Kubernetes node disk pressure issues typically occur on a node-by-node basis. In other words, one node might be running out of storage space while other nodes continue to have plenty of storage availability. This is because workloads can only consume storage provided by the specific nodes that host them, so if a Pod starts eating up too much disk space, it will only be using disk space on its node – without affecting other nodes in the cluster.
That said, if you deploy many storage-hungry Pods across your cluster, you could face disk pressure across most or all of your nodes.
We should also make clear that node disk pressure occurs when nodes run short of persistent storage space – not temporary or ephemeral storage provided by RAM.
Common node conditions, explained
Disk pressure is one of several types of "pressure" conditions that can affect nodes in Kubernetes. To place node disk pressure in context, let's discuss all of the main "pressure" conditions.
Memory pressure
Memory pressure occurs when a node runs short of available memory. This can be caused by issues like a memory leak in an application that causes it to consume more and more memory over time. It can also happen simply because too many Pods have been scheduled on a particular node.
Disk pressure
As we mentioned, disk pressure happens when a node runs short of storage space. There are several potential causes of node disk pressure; we'll dive into them in detail below.
PID pressure
PID pressure occurs when too many processes are running on a node and it can't create new processes. This happens because in Linux, each process is assigned a unique process ID (PID).
The exact maximum number of allowable PIDs varies between Linux distributions and configurations, but it's typically above 32,000 – which means that PID pressure is a rare issue, but it can happen if you have a very large number of processes running on a single node. It may also occur if you have security rules in place within Linux that restrict the total number of processes that can run simultaneously; some Linux distributions enforce policies like these to prevent attackers from spawning large numbers of illegitimate processes.
Lack of CPU
Lack of spare CPU on a node doesn't cause a "pressure" condition per se, but it can lead to Kubernetes CPU throttling, which slows down workloads until more CPU becomes available.
Why should you care about node disk pressure in Kubernetes?
Running out of available disk space can cause several types of problems for Kubernetes.
Pod eviction and rescheduling
As noted above, Kubernetes can evict Pods from a node that is facing disk pressure, then reschedule those Pods on a different node. This may not be a huge issue if the Pods are able to be rescheduled quickly. But even then, there may be some application downtime while the Pods are restarting on a new node.
There is also a risk that Kubernetes won't be able to find a new node to host the evicted Pods because no node is available with enough CPU, memory and storage to support them. In that case, any applications that were running based on the Pods will remain down until a node becomes available.
Node performance problems or crashes
If node pressure events can't be resolved quickly enough by evicting Pods, there is a risk that the affected nodes may crash entirely due toentirely due to a lack of lack of available disk space.
This can happen because the Linux operating system running on nodes typically requires disk space to do things like appending data to log files and starting new processes. In most cases, Linux distributions keep a certain amount of disk space in reserve (meaning it's only available for use by the kernel, not applications), which provides a buffer to help prevent crashes due to exhaustion of disk space. But if the buffer becomes used up, the operating system may begin to fail or operate very slowly.
Cluster performance and stability
If several nodes across your cluster experience disk pressure, the overall performance and stability of the cluster may begin to degrade.
That's because disk pressure issues that affect multiple nodes at the same time may result in the inability to schedule new Pods (or Pods that were evicted because of disk pressure). As a result, workloads will begin to go down or become less responsive.
There is also a risk that the control plane (meaning the core Kubernetes components that manage the rest of the cluster) could begin to crash, especially if you have configured control plane nodes to operate as worker nodes as well (which means the same node can host both the control plane and Pods).
Common causes of node disk pressure
Now that we know what node disk pressure means and why it's bad, let's talk about what causes it.
The underlying root cause of disk pressure, of course, is a lack of available disk space. But there are several specific reasons why a node might begin to run short of unused disk capacity.
Application logs exhaust local storage
A top cause of node disk pressure is the consumption of disk space by application logs.
Containers often write log files to record events that happen during the course of the container's operation. In most cases, there are no controls in place to restrict the size of the log files. As a result, a log file could become larger and larger over time, especially if it's not rotated (meaning older log data is deleted or moved outside the node to free up space).
Buggy logging logic could also result in excessively large log files in the event that a container writes more data to a log than it should.
Node is running too many Pods
Simply running too many Pods on a node can create disk pressure issues.
By default, Kubernetes attempts to schedule Pods on nodes by assessing how many resources each node has and how many resources each Pod will require based on resource limits and requests. But it sometimes miscalculates and places more Pods on a node than the node can handle. Or, admins may specify resource limits that are too high or low, leading to problematic Pod placement.
In addition, admins may manually schedule Pods on specific nodes using DaemonSets. If they haven't assessed whether a chosen node has enough storage to support all the Pods it is hosting, this could result in a Pod being placed on a node that will run out of disk space.
Misconfigured storage requests
Pods can be configured to request a persistent volume claim (PVC). PVCs can be based on storage resources that are shared across nodes, but they can also be mapped to a single node's local disk. In the latter case, they're referred to as local PVCs.
If too many Pods try to share the same PVC, and the PVC doesn't provide enough storage capacity to support all of the Pods, the storage resources on the node can become exhausted, leading to disk pressure issues.
Changes to node storage configurations
If you change the storage configuration of a running node, you may reduce the storage available to workloads. For example, if you unmount a disk while a node is running, the total storage that the node can provide will be reduced.
It's not very common to change storage configurations while a node is operating; typically, you'd want to drain the node first. But modifications of storage resources may occur due to operations like needing to replace a disk drive.
How to detect node disk pressure
The easiest way to detect node disk pressure is to run the following command:
The output will display a list of nodes and information about their current condition. Any nodes whose status includes the following condition are under disk pressure:
You can also check disk conditions on a node by logging into the node directly and running the command:
This will display information about the disk usage of all storage volumes mounted on the node, including which percentage of available storage resources are currently in use.
How to troubleshoot node disk pressure in Kubernetes
To troubleshoot node disk pressure issues, admins should typically work through the following steps.
1. Confirm node disk pressure
First, log into the node that is experiencing disk pressure and confirm that it is indeed running out of storage. Again, you can do this using the command:
If the output shows that the node continues to have more than a little free disk space, it may be the case that Kubernetes thinks disk pressure is an issue even though in reality it's not. In that scenario, you should check the permission settings of storage resources on the node to ensure Kubernetes can access all of them.
2. Analyze Pod disk usage
Once you're sure a node disk pressure issue truly exists, dig deeper to figure out how your Pods are using disk resources.
To do this, run the following command to get more information about each of the Pods running on the node:
In the output, look for the Volume section, which tells you which PVCs (if any) the Pod is using.
With that information, you can look at the storage resources mapped to the PVC to figure out which data actually exists in them. If there are large log files, for example, that's a clue that disk pressure happened because your containers are writing excessively large logs.
3. Analyze other disk usage
In addition to the disk space consumed by individual Pods, additional storage space on a node may be consumed by other Kubernetes components. The exact storage paths vary between distributions and container runtimes, but in general, look in the directory /var and in subdirectories like /var/lib and /var/lib/containers.
Kubernetes typically uses these locations to store data like container images, which could also be sucking up disk space and causing node disk pressure issues.
How to fix node disk pressure
The best way to fix node disk pressure depends on what, exactly, is causing it – and you should work through the Kubernetes troubleshooting steps we just described to get that insight.
That said, general practices for resolving node disk pressure include:
- Increasing storage: Adding storage capacity to a node is one way to resolve node disk pressure. However, you can only do this if you actually have extra storage that you can map onto the node, either by attaching disks directly to it or mounting network-connected storage using a protocol like NFS. You may also need to reconfigure PVCs so that the new disk space is usable by your Pods.
- Delete log files: Deleting log files is a fast way to free up disk space. Just be sure to copy the files to an external location first (such as a different node or network-attached storage) if you need to retain the log data.
- Delete container images: Deleting container images that are stored locally on a node can free up disk space.
- Remove non-essential Pods: Stopping non-essential Pods (or moving them to a different node) can add available disk space.
- Create a RAM disk: RAM disks are storage resources that can be used as if they're persistent local storage, but in actuality the storage is supplied by the server's ephemeral memory (RAM). Essentially, RAM disks allow you to borrow storage capacity from RAM and use it to increase the disk space of a node. The downside, of course, is that you reduce the amount of available RAM, which can lead to issues of its own (like memory pressure), and there is also a risk that data storage on a RAM disk will be lost permanently if the system suddenly shuts down. So, while RAM disks can be a useful short-term fix for node disk pressure, they're not a good long-term solution.
- Modify resource limits: If the resource limits of a Pod are causing it to be scheduled on a node that is not a good fit, or if limits are causing too many Pods to be scheduled on the same node, update the limits.
Solving Kubernetes node errors with groundcover
At groundcover, we can't make additional node storage appear out of nowhere. But we can tell you when your nodes are experiencing disk pressure or virtually any other type of problem. We also help you correlate a wide variety of observability data points from across your cluster so you can get to the root of performance issues quickly.
And we do it all using eBPF, a hyper-efficient technology that allows groundcover to collect observability data from Kubernetes with minimal overhead and resource consumption.
Stop feeling the pressure
Disk space is an essential resource for nodes, and bad things can start to happen when it runs short. But that doesn't mean you have to sit idly by while your Pods or nodes crash due to lack of disk space. By monitoring for nodes that are running out of storage capacity before the situation becomes critical, and by knowing how to troubleshoot and resolve disk pressure events effectively, you can ensure that disk resources don't become the weakest link in Kubernetes performance.
Sign up for Updates
Keep up with all things cloud-native observability.