How to Troubleshoot and Fix Kubernetes Node Not Ready Issues
Learn how to diagnose and fix Kubernetes “node not ready” errors. Discover causes, troubleshooting steps, and best practices for smooth Kubernetes operations.
Nodes are one of the fundamental building blocks of a Kubernetes cluster – which is why having nodes stuck in the "not ready" state is a big problem. When nodes aren't ready, they can't host workloads. They are, in other words, dead weight until you figure out what caused them to end up being not ready and fix the issue.
Keep reading for guidance as we explain everything Kubernetes admins need to know about “node not ready” issues – including what they mean, what causes them, how to troubleshoot nodes that are not ready, and how to fix the problem.
What is the Kubernetes node not ready error?
Kubernetes "node not ready" is an error indicating that Kubernetes nodes can't host workload (to put that in slightly more technical terms, it means the nodes can't schedule pods). It's a node status assigned by the Kubernetes node controller, which is responsible for monitoring the state of nodes.
"Node not ready" could indicate that the Kubernetes API server and other control plane components can't communicate reliably with the node at all because of problems like the node being stuck in a crash-restart loop or a flaky network connection. The error could also indicate that the node is reachable but is unable to support pods due to issues with the kubelet or kube-proxy processes running on the node.
You can determine whether a “node not ready” issue exists for any of your nodes by running:
The output will include a list of nodes and their status (among other information). Any nodes whose status matches NotReady are in the “not ready” state.
Understanding Kubernetes node states
Before diving deeper into what causes "node not ready" errors, let's step back a bit and explain how Kubernetes tracks node status in general.
In Kubernetes, a node is a server that forms part of a Kubernetes cluster. Most nodes function as worker nodes, which means their job is to host applications (which are deployed in Kubernetes using pods). Some nodes are control-plane nodes, meaning they host the software that manages the rest of the Kubernetes cluster.
Once you join a node to a cluster, it can exist in one of the following four node states:
- Ready: The node is functioning normally and can host applications.
- NotReady: The node has a problem and can't host applications.
- SchedulingDisabled: The node is functioning normally but can't host applications because admins have used Kubernetes "cordon" feature to disable scheduling on that node.
- Unknown: The node is completely unreachable, typically due either to a failed network connection or because the node has permanently shut down.
What causes “node not ready” errors?
There are many potential causes of the “node not ready” error. The following are the most common.
1. Insufficient system resources
Nodes that lack sufficient CPU or memory to host workloads may experience “node not ready” errors.
Typically, this issue occurs when you join a server to your cluster that simply doesn't have enough spare resources to host any workloads because all of its CPU and memory is being consumed by other, non-Kubernetes related applications or processes that are running on the node. Memory leaks or other bugs that cause the node to waste CPU or memory resources could also be the underlying problem.
2. kubelet issues
Kubelet is an agent that runs on each node and manages the node's connection with the cluster. If kubelet experiences a problem, it could lead to Kubernetes node not ready problems because Kubernetes can no longer reliably communicate with the node via kubelet.
In general, kubelet issues are rare because kubelet is stable software. But you may experience situations where the node's operating system kills the kubelet process to free up CPU or memory. Or, you may be running a buggy version of kubelet, especially if you're using an experimental Kubernetes release.
3. kube-proxy issues
Each node in a Kubernetes cluster also runs kube-proxy, a networking agent whose main job is to enforce a networking configuration on each node that matches the network Services configured through Kubernetes. Problems with node-proxy could cause node not ready errors by preventing the node from being able to communicate normally with the control plane.
As with kubelet, kube-proxy is typically stable and issues with it are rare. But the operating system could kill kube-proxy for some reason, or buggy code could trigger unusual kube-proxy behavior.
4. Networking issues
Even if kube-proxy is functioning normally, problems with other networking software or infrastructure could lead to “node not ready” problems. The network that connects a node to the cluster might simply be flaky, causing intermittent disconnects. Or, problems like IP address conflicts (which happen when the node is assigned the same IP address as other endpoints on the network) could make it difficult for the control plane to reach the node reliably.
How to troubleshoot “node not ready” issues
Use the following steps to troubleshoot problems with nodes stuck in the NotReady state.
1. Confirm node status
First, double-check that your node is indeed in the NotReady node status. As noted above, you can do this by running:
If you've just noticed this issue for the first time, it may be worth waiting a few minutes and checking again. Occasionally, the “node not ready” issue will resolve itself (especially in cases where the problem is due to a fluke, like a short-lived networking problem that doesn't frequently occur).
2. Connect to node
As a next step, connect to the node to make sure it's definitely up and functioning. This allows you to rule out issues like the node crashing or being completely unreachable via the network.
The best way to connect to the node will depend on how you set up your nodes. But in most cases, you can use an SSH command like the following:
3. Describe node
Assuming the node is indeed up and running, the next step in the troubleshooting process is to use kubectl to get more information about the node. You can do this by running:
(Replace node-name with the actual name of the node.)
Review the output, looking in particular at the following sections:
- Conditions: This tells you whether the node is experiencing any adverse conditions, such as MemoryPressure (meaning it's running low on memory) or DiskPressure (meaning it's low on disk space due to Kubernetes disk pressure problems). If one of these conditions is true, it's likely the cause of the issue, and you can resolve it by mitigating the problem – such as by allocating killing processes to free up memory, in the case of MemoryPressure. (This section will also tell you that the node is in the NotReady state, but you already know that.)
- Events: This will typically tell you when the node first became NotReady. It may also include information about other relevant events, like failure to start containers.
4. View node and kubelet logs
If no node conditions or events help to explain what caused the Kubernetes node not ready error, the next step is to examine the node and kubelet logs. The exact location of logs varies between operating systems, but on most Linux distributions, you can find most logs in /var/log. The most important log file is typically syslog.
So, SSH into the node and open up syslog by running:
As you review the log, look for events related to kubelet or kube-proxy. If these processes have shut down or been killed, you'll typically find information about those events in this log.
Depending on how you installed Kubernetes, you can also typically view kubelet logs using:
As with syslog, reviewing the kubelet logs can help identify events related to kubelet crashing or otherwise behaving erratically.
5. Review other node details
If you're still at a loss as to why the Kubernetes node is NotReady, there are a few other things you can check while logged into the node:
- The top command on most Linux distributions will display information about running processes and how many resources they are using. If kubelet or kube-proxy are misbehaving because of issues like memory leaks, this data may clue you in.
- The df command displays data about disk space usage. If the node is running very low on disk space, this will tell you. It will also tell you exactly which partition is running out of space, in the event that there are multiple partitions.
- The netstat command displays information about network connections, which may be useful for identifying unusual network behavior.
Generally, most of the relevant data you can get from these commands would also be recorded in syslog. But in certain cases – such as if the system has run out of space to the point that it can no longer record events in syslog because there is no space to expand the file – it may not be, so it's worth performing these additional checks.
6. Verify network connectivity
In some cases, the node's networking configuration may appear valid based on information provided by the Kubernetes node itself, but this doesn't necessarily mean the Kubernetes control plane can reach the node.
To check for issues in the connection between the node and the control plane, first determine the node's IP address, which you can find by running the following kubectl get nodes command:
Then, SSH into a control plane node and run the following command:
Replace node-ip-address with the IP address that kubectl reports for the node.
The output will display data about the flow of network packets between the control plane and the node. If packets are being held up at some point on the network – such as when they exit a subnet – this information will help you identify the problem.
7. Check kube-system components
Kube-system is a namespace that hosts objects created by the control plane, including kube-proxy. Verifying the status of resources running in this namespace can be helpful for troubleshooting in cases where an issue on the control plane side, like a failed kube-proxy pod, has caused nodes to become NotReady (that said, if the issue lies with the control plane, it's likely that most or all of your nodes will become NotReady, so this is rarely the culprit).
8. Restart kubelet and kube-proxy
Restarting the kubelet service and kube-proxy on the node may help to resolve Kubernetes node “not ready” issues. In addition, watching log events and resource utilization by kubelet and kube-proxy as they restart could provide insight into why they are not functioning normally. For example, you may notice that one of these processes steadily increases its memory usage over time, which is an indication of a memory leak.
On most Linux distributions, you can restart the kubelet service and kube-proxy with:
9. Restart the node
As a final troubleshooting step, you can try restarting the entire Kubernetes node. While this won't necessarily tell you why the issue occurred, it may resolve it in cases where the problem stemmed from a temporary failure or misconfiguration.
That said, if this does fix the issue, you'll want to keep watching the node closely to ensure that it operates normally. It's possible that problems like memory leaks will cause the node to run low on resources again over time, causing the NotReady error to recur eventually.
Best practices to prevent node NotReady errors
Successfully troubleshooting node NotReady errors is good. What's even better is preventing them from occurring in the first place. The following best practices can help in this regard by minimizing the risk of node NotReady problems.
1. Regular monitoring and alerting
The single most important step you can take to prevent node not ready issues is to use Kubernetes monitoring tools to observe your nodes continuously and generate alerts when something looks awry.
For example, alerting tools can tell you that your node is running short on CPU, memory, or disk space well before the issue becomes critical and causes the node to stop functioning normally. Likewise, network monitoring tools can alert you to network disconnects, high network latency, or packet loss issues, which provides early warning about problems that may cause the node to become unavailable due to networking problems.
2. Resource capacity planning
Carefully planning resource capacity for nodes is another best practice for preventing NotReady errors. Capacity planning means ensuring that the servers you join to your cluster as worker nodes have enough CPU, memory, and disk space to support the workloads you intend to run on them.
In addition, you should avoid forcing pods to run on nodes that lack enough resources to handle them. For example, before creating a DaemonSet to schedule pods on a specific node, check the node's resource utilization status to ensure it's a good fit.
3. Node autoscaling
Node autoscaling allows you to increase the total nodes in your cluster and/or modify the resource allocations to individual nodes. Autoscaling can help to prevent “node not ready” issues by ensuring that if a node starts running short on resources, the node either receives more resources, or the cluster adds nodes and shifts some workloads to new nodes.
4. Network topology planning
Network configurations that are highly complex, or ones where control plane nodes are distant from worker nodes, could contribute to node NotReady errors due to network connectivity issues. For that reason, consider trying to keep your network topology and configuration simple. For example, assign control plane nodes and worker nodes to the same subnet if possible.
To be clear, having a complex network topology doesn't necessarily mean your nodes will end up being NotReady, and there are situations where you have little control over the network anyway. But as a general best practice, if you can keep your network design simpler, do it.
Solving Kubernetes node errors with groundcover
As a comprehensive Kubernetes monitoring and observability platform, groundcover provides the visibility you need to detect, troubleshoot, and resolve node NotReady errors.
With groundcover, you can continuously track Kubernetes metrics and node resource utilization. You can also drill down to get details about individual nodes. The result is the ability not just to detect issues fast, but also to investigate their context and get to the root of the problem as rapidly as possible.
Keeping nodes at the ready
Without properly functioning nodes, your Kubernetes cluster may as well not exist at all. That's why it's critical to know how to diagnose and troubleshoot node NotReady errors – and, even better, to adopt best practices that help prevent these issues from occurring in the first place.
FAQS
Here are answers to common questions about CrashLoopBackOff
How do I delete CrashLoopBackoff Pod?
To delete a Pod that is stuck in a CrashLoopBackOff, run:
kubectl delete pods pod-name
If the Pod won't delete – which can happen for various reasons, such as the Pod being bound to a persistent storage volume – you can run this command with the --force flag to force deletion. This tells Kubernetes to ignore errors and warnings when deleting the Pod.
How do I fix CrashLoopBackoff without logs?
If you don't have Pod or container logs, you can troubleshoot CrashLoopBackOff using the command:
kubectl describe pod pod-name
The output will include information that allows you to confirm that a CrashLoopBackOff error has occurred. In addition, the output may provide clues about why the error occurred – such as a failure to pull the container image or connect to a certain resource.
If you're still not sure what's causing the error, you can use the other troubleshooting methods described above – such as checking DNS settings and environment variables – to troubleshoot CrashLoopBackOff without having logs.
Once you determine the cause of the error, fixing it is as easy as resolving the issue. For example, if you have a misconfigured file, simply update the file.
How do I fix CrashLoopBackOff containers with unready status?
If a container experiences a CrashLoopBackOff and is in the unready state, it means that it failed a readiness probe – a type of health check Kubernetes uses to determine whether a container is ready to receive traffic.
In some cases, the cause of this issue is simply that the health check is misconfigured, and Kubernetes therefore deems the container unready even if there is not actually a problem. To determine whether this might be the root cause of your issue, check which command (or commands) are run as part of the readiness check. This is defined in the container spec of the YAML file for the Pod. Make sure the readiness checks are not attempting to connect to resources that don't actually exist.
If your readiness probe is properly configured, you can investigate further by running:
kubectl get events
This will show events related to the Pod, including information about changes to its status. You can use this data to figure out how far the Pod progressed before getting stuck in the unready status. For example, if its container images were pulled successfully, you'll see that.
You can also run the following command to get further information about the Pod's configuration:
kubectl describe pod pod-name
Checking Pod logs, too, may provide insights related to why it's unready.
For further guidance, check out our guide to Kubernetes readiness probes.
Kubernetes Academy
Related content
Sign up for Updates
Keep up with all things cloud-native observability.