How to Troubleshoot and Fix Kubernetes Node Not Ready Issues

Nodes are one of the fundamental building blocks of a Kubernetes cluster – which is why having nodes stuck in the "not ready" state is a big problem. When nodes aren't ready, they can't host workloads. They are, in other words, dead weight until you figure out what caused them to end up being not ready and fix the issue.

Keep reading for guidance as we explain everything Kubernetes admins need to know about “node not ready” issues – including what they mean, what causes them, how to troubleshoot nodes that are not ready, and how to fix the problem.

What is the Kubernetes node not ready error?

Kubernetes "node not ready" is an error indicating that Kubernetes nodes can't host workload (to put that in slightly more technical terms, it means the nodes can't schedule pods). It's a node status assigned by the Kubernetes node controller, which is responsible for monitoring the state of nodes.

"Node not ready" could indicate that the Kubernetes API server and other control plane components can't communicate reliably with the node at all because of problems like the node being stuck in a crash-restart loop or a flaky network connection. The error could also indicate that the node is reachable but is unable to support pods due to issues with the kubelet or kube-proxy processes running on the node.

You can determine whether a “node not ready” issue exists for any of your nodes by running:

kubectl get nodes

The output will include a list of nodes and their status (among other information). Any nodes whose status matches NotReady are in the “not ready” state.

Understanding Kubernetes node states

| State | Meaning | Main causes | |---|---|---| | Ready | Node is operating normally. | Not applicable. | | NotReady | Node can't schedule pods. | Resource exhaustion, problems with kubelet or kube-proxy, networking issues. | | SchedulingDisabled | Node is "cordoned" and can't schedule pods. | Admins deliberately configured node not to be able to host pods. | | Unknown | The node is entirely unreachable and Kubernetes. | Node has permanently crashed; network connection to node has permanently failed. |

Before diving deeper into what causes "node not ready" errors, let's step back a bit and explain how Kubernetes tracks node status in general.

In Kubernetes, a node is a server that forms part of a Kubernetes cluster. Most nodes function as worker nodes, which means their job is to host applications (which are deployed in Kubernetes using pods). Some nodes are control-plane nodes, meaning they host the software that manages the rest of the Kubernetes cluster.

Once you join a node to a cluster, it can exist in one of the following four node states:

Ready: The node is functioning normally and can host applications.
NotReady: The node has a problem and can't host applications.
SchedulingDisabled: The node is functioning normally but can't host applications because admins have used Kubernetes "cordon" feature to disable scheduling on that node.
Unknown: The node is completely unreachable, typically due either to a failed network connection or because the node has permanently shut down.

What causes “node not ready” errors?

There are many potential causes of the “node not ready” error. The following are the most common.

1. Insufficient system resources

Nodes that lack sufficient CPU or memory to host workloads may experience “node not ready” errors.

Typically, this issue occurs when you join a server to your cluster that simply doesn't have enough spare resources to host any workloads because all of its CPU and memory is being consumed by other, non-Kubernetes related applications or processes that are running on the node. Memory leaks or other bugs that cause the node to waste CPU or memory resources could also be the underlying problem.

2. kubelet issues

Kubelet is an agent that runs on each node and manages the node's connection with the cluster. If kubelet experiences a problem, it could lead to Kubernetes node not ready problems because Kubernetes can no longer reliably communicate with the node via kubelet.

In general, kubelet issues are rare because kubelet is stable software. But you may experience situations where the node's operating system kills the kubelet process to free up CPU or memory. Or, you may be running a buggy version of kubelet, especially if you're using an experimental Kubernetes release.

Kubernetes architecture with a control plane and managing nodes, connecting to Cloud Provider API for infrastructure management and workload orchestration.

3. kube-proxy issues

Each node in a Kubernetes cluster also runs kube-proxy, a networking agent whose main job is to enforce a networking configuration on each node that matches the network Services configured through Kubernetes. Problems with node-proxy could cause node not ready errors by preventing the node from being able to communicate normally with the control plane.

As with kubelet, kube-proxy is typically stable and issues with it are rare. But the operating system could kill kube-proxy for some reason, or buggy code could trigger unusual kube-proxy behavior.

4. Networking issues

Even if kube-proxy is functioning normally, problems with other networking software or infrastructure could lead to “node not ready” problems. The network that connects a node to the cluster might simply be flaky, causing intermittent disconnects. Or, problems like IP address conflicts (which happen when the node is assigned the same IP address as other endpoints on the network) could make it difficult for the control plane to reach the node reliably.

How to troubleshoot “node not ready” issues

Use the following steps to troubleshoot problems with nodes stuck in the NotReady state.

1. Confirm node status

First, double-check that your node is indeed in the NotReady node status. As noted above, you can do this by running:

kubectl get nodes

If you've just noticed this issue for the first time, it may be worth waiting a few minutes and checking again. Occasionally, the “node not ready” issue will resolve itself (especially in cases where the problem is due to a fluke, like a short-lived networking problem that doesn't frequently occur).

2. Connect to node

As a next step, connect to the node to make sure it's definitely up and functioning. This allows you to rule out issues like the node crashing or being completely unreachable via the network.

The best way to connect to the node will depend on how you set up your nodes. But in most cases, you can use an SSH command like the following:

ssh user@node-name

3. Describe node

Assuming the node is indeed up and running, the next step in the troubleshooting process is to use kubectl to get more information about the node. You can do this by running:

kubectl describe node-name

(Replace node-name with the actual name of the node.)

Review the output, looking in particular at the following sections:

Conditions: This tells you whether the node is experiencing any adverse conditions, such as MemoryPressure (meaning it's running low on memory) or DiskPressure (meaning it's low on disk space due to Kubernetes disk pressure problems). If one of these conditions is true, it's likely the cause of the issue, and you can resolve it by mitigating the problem – such as by allocating killing processes to free up memory, in the case of MemoryPressure. (This section will also tell you that the node is in the NotReady state, but you already know that.)‍
Events: This will typically tell you when the node first became NotReady. It may also include information about other relevant events, like failure to start containers.

Resource utilization for Nodes showing CPU and memory usage where CPU is overcommitted.

4. View node and kubelet logs

If no node conditions or events help to explain what caused the Kubernetes node not ready error, the next step is to examine the node and kubelet logs. The exact location of logs varies between operating systems, but on most Linux distributions, you can find most logs in /var/log. The most important log file is typically syslog.

So, SSH into the node and open up syslog by running:

less /var/log/syslog

As you review the log, look for events related to kubelet or kube-proxy. If these processes have shut down or been killed, you'll typically find information about those events in this log.

Depending on how you installed Kubernetes, you can also typically view kubelet logs using:

journalctl -u kubelet

As with syslog, reviewing the kubelet logs can help identify events related to kubelet crashing or otherwise behaving erratically.

5. Review other node details

If you're still at a loss as to why the Kubernetes node is NotReady, there are a few other things you can check while logged into the node:

The top command on most Linux distributions will display information about running processes and how many resources they are using. If kubelet or kube-proxy are misbehaving because of issues like memory leaks, this data may clue you in.
The df command displays data about disk space usage. If the node is running very low on disk space, this will tell you. It will also tell you exactly which partition is running out of space, in the event that there are multiple partitions.
The netstat command displays information about network connections, which may be useful for identifying unusual network behavior.

Generally, most of the relevant data you can get from these commands would also be recorded in syslog. But in certain cases – such as if the system has run out of space to the point that it can no longer record events in syslog because there is no space to expand the file – it may not be, so it's worth performing these additional checks.

6. Verify network connectivity

In some cases, the node's networking configuration may appear valid based on information provided by the Kubernetes node itself, but this doesn't necessarily mean the Kubernetes control plane can reach the node.

To check for issues in the connection between the node and the control plane, first determine the node's IP address, which you can find by running the following kubectl get nodes command:

kubectl get nodes -o wide

Then, SSH into a control plane node and run the following command:

traceroute node-ip-address

Replace node-ip-address with the IP address that kubectl reports for the node.

The output will display data about the flow of network packets between the control plane and the node. If packets are being held up at some point on the network – such as when they exit a subnet – this information will help you identify the problem.

7. Check kube-system components

Kube-system is a namespace that hosts objects created by the control plane, including kube-proxy. Verifying the status of resources running in this namespace can be helpful for troubleshooting in cases where an issue on the control plane side, like a failed kube-proxy pod, has caused nodes to become NotReady (that said, if the issue lies with the control plane, it's likely that most or all of your nodes will become NotReady, so this is rarely the culprit).

8. Restart kubelet and kube-proxy

Restarting the kubelet service and kube-proxy on the node may help to resolve Kubernetes node “not ready” issues. In addition, watching log events and resource utilization by kubelet and kube-proxy as they restart could provide insight into why they are not functioning normally. For example, you may notice that one of these processes steadily increases its memory usage over time, which is an indication of a memory leak.

On most Linux distributions, you can restart the kubelet service and kube-proxy with:

sudo systemctl restart kubelet
sudo systemctl restart kube-proxy

9. Restart the node

As a final troubleshooting step, you can try restarting the entire Kubernetes node. While this won't necessarily tell you why the issue occurred, it may resolve it in cases where the problem stemmed from a temporary failure or misconfiguration.

That said, if this does fix the issue, you'll want to keep watching the node closely to ensure that it operates normally. It's possible that problems like memory leaks will cause the node to run low on resources again over time, causing the NotReady error to recur eventually.

Best practices to prevent node NotReady errors

Successfully troubleshooting node NotReady errors is good. What's even better is preventing them from occurring in the first place. The following best practices can help in this regard by minimizing the risk of node NotReady problems.

1. Regular monitoring and alerting

The single most important step you can take to prevent node not ready issues is to use Kubernetes monitoring tools to observe your nodes continuously and generate alerts when something looks awry.

For example, alerting tools can tell you that your node is running short on CPU, memory, or disk space well before the issue becomes critical and causes the node to stop functioning normally. Likewise, network monitoring tools can alert you to network disconnects, high network latency, or packet loss issues, which provides early warning about problems that may cause the node to become unavailable due to networking problems.

2. Resource capacity planning

Carefully planning resource capacity for nodes is another best practice for preventing NotReady errors. Capacity planning means ensuring that the servers you join to your cluster as worker nodes have enough CPU, memory, and disk space to support the workloads you intend to run on them.

In addition, you should avoid forcing pods to run on nodes that lack enough resources to handle them. For example, before creating a DaemonSet to schedule pods on a specific node, check the node's resource utilization status to ensure it's a good fit.

3. Node autoscaling

Node autoscaling allows you to increase the total nodes in your cluster and/or modify the resource allocations to individual nodes. Autoscaling can help to prevent “node not ready” issues by ensuring that if a node starts running short on resources, the node either receives more resources, or the cluster adds nodes and shifts some workloads to new nodes.

4. Network topology planning

Network configurations that are highly complex, or ones where control plane nodes are distant from worker nodes, could contribute to node NotReady errors due to network connectivity issues. For that reason, consider trying to keep your network topology and configuration simple. For example, assign control plane nodes and worker nodes to the same subnet if possible.

To be clear, having a complex network topology doesn't necessarily mean your nodes will end up being NotReady, and there are situations where you have little control over the network anyway. But as a general best practice, if you can keep your network design simpler, do it.

Solving Kubernetes node errors with groundcover

As a comprehensive Kubernetes monitoring and observability platform, groundcover provides the visibility you need to detect, troubleshoot, and resolve node NotReady errors.

Node status groundcover dashboard showing CPU, Memory, Disk, and Pods metrics over time, along with metadata like OS, kernel version, container runtime, and internal IP and DNS details.

With groundcover, you can continuously track Kubernetes metrics and node resource utilization. You can also drill down to get details about individual nodes. The result is the ability not just to detect issues fast, but also to investigate their context and get to the root of the problem as rapidly as possible.

Keeping nodes at the ready

Without properly functioning nodes, your Kubernetes cluster may as well not exist at all. That's why it's critical to know how to diagnose and troubleshoot node NotReady errors – and, even better, to adopt best practices that help prevent these issues from occurring in the first place.

Kubernetes Academy

Sign up for Updates

Keep up with all things cloud-native observability.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

How to Troubleshoot and Fix Kubernetes Node Not Ready Issues

What is the Kubernetes node not ready error?

Understanding Kubernetes node states

What causes “node not ready” errors?

1. Insufficient system resources

2. kubelet issues

3. kube-proxy issues

4. Networking issues

How to troubleshoot “node not ready” issues

1. Confirm node status

2. Connect to node

3. Describe node

4. View node and kubelet logs

5. Review other node details

6. Verify network connectivity

7. Check kube-system components

8. Restart kubelet and kube-proxy

9. Restart the node

Best practices to prevent node NotReady errors

1. Regular monitoring and alerting

2. Resource capacity planning

3. Node autoscaling

4. Network topology planning

Solving Kubernetes node errors with groundcover

Keeping nodes at the ready

Kubernetes Academy

Related content

Kubernetes DNS Troubleshooting: Causes & Best Practices

How to Troubleshoot and Fix Kubernetes Node Not Ready Issues

Exit Code 127: Causes & Tips to Manage It Effectively

Exit Code 137: Causes & Best Practices to Prevent It

Fix CreateContainerConfigError & CreateContainerError

The Ultimate Troubleshooting Guide for Exit Code 143

Understanding Kubernetes OOMKilled Errors: Preventing and Troubleshooting Out-of-Memory Issues

Exit Code 139 Explained: Common Causes and How to Fix It

Kubernetes ImagePullBackOff: What It Is and How to Fix It

Understanding Kubernetes CrashLoopBackOff & How to Fix It

Sign up for Updates

Get started with groundcover

See the platform in action

Book an on-demand demo with a customer engineer

100% visibility all the time.

Troubleshoot like a pro.

Reduce data & growth costs, dramatically.

Done!

What is the Kubernetes node not ready error?

Understanding Kubernetes node states

What causes “node not ready” errors?

1. Insufficient system resources

2. kubelet issues

3. kube-proxy issues

4. Networking issues

How to troubleshoot “node not ready” issues

1. Confirm node status

2. Connect to node

3. Describe node

4. View node and kubelet logs

5. Review other node details

6. Verify network connectivity

7. Check kube-system components

8. Restart kubelet and kube-proxy

9. Restart the node

Best practices to prevent node NotReady errors

1. Regular monitoring and alerting

2. Resource capacity planning

3. Node autoscaling

4. Network topology planning

Solving Kubernetes node errors with groundcover

Keeping nodes at the ready

Kubernetes Academy

Related content

Kubernetes DNS Troubleshooting: Causes & Best Practices

How to Troubleshoot and Fix Kubernetes Node Not Ready Issues

Exit Code 127: Causes & Tips to Manage It Effectively

Exit Code 137: Causes & Best Practices to Prevent It

Fix CreateContainerConfigError & CreateContainerError

The Ultimate Troubleshooting Guide for Exit Code 143

Understanding Kubernetes OOMKilled Errors: Preventing and Troubleshooting Out-of-Memory Issues

Exit Code 139 Explained: Common Causes and How to Fix It

Kubernetes ImagePullBackOff: What It Is and How to Fix It

Understanding Kubernetes CrashLoopBackOff & How to Fix It

Sign up for Updates

Get started  with groundcover