Shahar Azulay Founder and CEO
Shahar Azulay
,
Founder and CEO
10
minutes read,
May 30th, 2023

Kubernetes has done to software monitoring and management what modern automotive technologies did to car repair: It has made it much harder to troubleshoot problems. Just as DIY mechanics often complain that the complexity of modern cars makes them way too complicated to maintain and fix, the complexity of Kubernetes makes it much more difficult to troubleshoot workload performance problems in many cases.

Nonetheless, the harsh truth is that if you want to use Kubernetes, you need to be able to troubleshoot it effectively. Keep reading for guidance on that topic as we explain the essentials of Kubernetes troubleshooting, as well as how to respond to common Kubernetes error codes.

What Is Kubernetes Troubleshooting?

Kubernetes troubleshooting is the process of detecting and remediating any type of performance issue that arises within a Kubernetes environment. Common performance problems that you might encounter on Kubernetes include:

  • Containers or Pods that fail to start.
  • Containers or Pods that take a long time to start.
  • Applications that are slow to respond to requests.
  • Applications that can't interface with the network properly.
  • The unexpected crash of a container or Pod.
  • Pods being placed on the wrong nodes, leading to insufficient availability of resources.
  • Slow application performance due to poor choice of resource limits.

This is only a partial list. In a production Kubernetes environment, you could run into any number of potential performance issues that you'll need to identify and troubleshoot to avoid disruptions to your end-users.

Three Pillars of Kubernetes Troubleshooting

Like observability, which is also said to have three "pillars" (metrics, logs, and traces), Kubernetes troubleshooting also depends on three key components: Understanding, Management, and Prevention.

Understanding

Understanding means determining the state of your workloads, identifying when you have an issue, and assessing what it will take to fix the problem.

For example, imagine that one of your applications has become slow to respond to requests. To gain an understanding of the issue, you might use kubectl to list the application's Pods, then inspect Pod logs and error codes to investigate the status of each Pod. Using this approach, you might identify a specific Pod that has crashed. Going deeper, you explore resource utilization metrics for the node that hosted the Pod, then discover that the node's CPU was maxed out. You then determine that the Pod probably crashed because its host node lacked enough CPU. Finally, you conclude that Kubernetes couldn't reschedule the Pod on a different node because you deployed the Pod using a DaemonSet, which required it to run on that specific node.

Manage

Once you understand an issue, you can manage it. Management means taking steps to fix whatever caused the problem.

To go back to the example we just mentioned about a Pod that has crashed because its host node has run out of CPU, the easiest way to manage the problem would be to modify the DaemsonSet so that the Pod can run on a different node. Or, you might decide to replace the DaemonSet with a standard Deployment, which would allow Kubernetes to reschedule the Pod on any node.

A third option would be to allocate more CPU resources to the node (assuming it's a virtual machine that can be reconfigured this way). But that would probably only make sense if you truly need the Pod to operate on that specific node for some reason.

Prevention

The final key element in troubleshooting is prevention, which means taking steps to prevent the issue (or similar problems) from recurring.

In the example we're discussing, prevention might involve setting up alerts in your Kubernetes observability software so that you'll be warned when node CPU usage exceeds a certain threshold – such as 80 percent – so that you can take action before Pods crash. You could also consider strategies like node autoscaling (if it's supported by your Kubernetes environment), which would automatically add nodes to your cluster to avoid exhaustion of resources. (Of course, node autoscaling would only help prevent Pod crashes if Kubernetes can reschedule Pods to a new node in the event that the original node is running out of CPU, and that only works if DaemonSet configurations don't require Pods to run on specific nodes.)

Kubernetes Troubleshooting Challenges

Kubernetes troubleshooting would be easy if Kubernetes were a straightforward, uncomplicated system.

Alas, it's not. Kubernetes is a very complex platform that includes a variety of distinct components – an API server, an etcd key-value store, control plane nodes, worker nodes, Pods,  various network resources and more. These components interact with each other in complicated ways, with the result that it's often not obvious what the root cause of a performance issue is based simply on the surface-level manifestation of the issue.

For example, imagine you're troubleshooting an application hosted on Kubernetes that is experiencing high latency. You know the latency rate for the app, but that information alone tells you little about the root cause of the problem, which could be any number of the following:

  • Congestion on the network.
  • A problematic configuration with the networking plugin you're using.
  • Buggy code in the application that’s causing it not to respond to requests quickly.
  • Insufficient resources for the Pod that hosts the app, causing slow performance.

We could go on, but the point is this: Kubernetes troubleshooting is tricky because there are so many potential root causes to sort through, and so many individual resources that you need to monitor and observe to maintain the visibility necessary to trace problems to their root cause.

Common Kubernetes Errors and Their Fixes

Fortunately, Kubernetes doesn't leave you totally in the dark when you encounter a performance issue. It provides various error codes, which in many cases are the best piece of information available for figuring out what triggered a problem.

How to Identify Exit Codes and Errors

The technique for finding exit codes and error events is the same in most cases, regardless of which type of issue you're looking for. Here's an overview of the process.

First, check for exit codes events using the kubectl describe command to request information about Pods. The information includes any exit codes associated with containers in the Pod.

For example, this command provides exit code status and other information for the Pod named my-pod:

kubectl describe pod my-pod

If any containers exited with a code, the output of this command will specify an exit code in a format like the following:

If you want to grab exit codes using a one-line command, you can pipe the kubectl output into grep using a command such as:

To detect error events, use the kubectl get pods command. If any Pods crashed due to errors like being OOMKilled or getting stuck in a CrashLoopBackOff state, you'll see the issue noted under the STATUS column. For example:

Alternatively, check the Pod logs using the kubectl logs command. The logs also record error events.

Now that we know how to get exit codes and error status, let's look at common exit codes and error statues in Kubernetes and how to fix them.

Exit Code 1

Exit code 1 in Kubernetes means that a container terminated due to an application error. This typically indicates a problem with the container image, or with code inside the image.

How to Fix Exit Code 1

If you encounter exit code 1, try running the container directly from the command line, instead of deploying it in Kubernetes, to verify that it starts properly. If that works, ensure that the image you're pointing to in your Kubernetes deployment is not corrupted.

Exit Code 125

Exit code 125 means that a container failed to run because the command that Kubernetes tried to use to run it didn't execute successfully.

How to Fix Exit Code 125

If you see this error code, check the commands inside your container image for typos or undefined arguments or flags. You should also ensure that permissions settings are configured properly.

Exit Code 143

Exit code 143 happens when a container receives the SIGTERM signal. This is a signal sent by the operating system that tells a container to shut down.

How to Fix Exit code 143

Exit code 143 often does not indicate a problem at all; in many cases, it simply means that the orchestration engine asked the container to shut down for a legitimate reason. But if your containers keep shutting down with code 143 when they shouldn't, look at the kubelet logs to see what the source of the SIGTERM request was.

Read more about troubleshooting Exit Code 143.

Exit Code 137

Exit code 137 in Kubernetes indicates that a Pod was terminated by the Linux SIGKILL signal (also known as signal 9). This typically occurs when a container exceeds its allocated memory limit, but it can also happen due to a failed health check.

How to Fix Exit Code 137

To address exit code 137, first, check if the container is exceeding its memory limits by reviewing resource allocations. You may need to increase the memory limit or optimize the container's memory usage. Additionally, monitor kubelet logs to confirm the cause of the termination. If memory issues persist, inspect your application code for potential memory leaks, using load testing tools and debuggers to identify and fix the problem.

Read more about troubleshooting Exit Code 137.

Exit Code 139

The appearance of exit code 139 corresponds to a specific scenario wherein a container is subjected to the SIGSEGV signal originating from the underlying operating system residing on its host node.

With Linux and Unix-like operating systems, SIGSEGV represents a category of termination signals that mandates a process to undergo shutdown proceedings. This signal typically emerges when the operating system identifies a process attempting to access system memory that either does not exist or lacks the necessary permissions to be accessed – a phenomenon referred to as a segmentation fault, often colloquially termed a "segfault" in the realm of avid Linux enthusiasts.

When a container encounters SIGSEGV, it generally results in termination. Such an outcome is less than ideal since the norm is to keep containers operational unless a deliberate decision is made to shut them down. However, the alternative to SIGSEGV is the potential scenario wherein an entire server might succumb to a crash due to multiple processes vying for access to the same memory address. Picture it as a situation where all the dogs in a neighborhood rush into a single yard to engage in a brawl – utter chaos that disrupts everything because no container can securely access memory resources.

Hence, the issuance of the SIGSEGV error by the operating system serves as a preventive measure, intended to avert a much larger-scale crisis.

How to Fix Exit Code 139

To fix exit code 139, you first need to figure out why the operating system wants to shut down your container. The most common causes include:

  • Library compatibility problems.
  • Buggy code.
  • Hardware compatibility issues (especially when running containers on servers other than x86 systems).

Once you determine the root cause of the error, you can address it – by, for example, updating code.

For a deeper dive on this topic, check out our complete guide to troubleshooting Exit Code 139.

CrashLoopBackOff

CrashLoopBackOff in Kubernetes, a common but solvable problem, occurs if a container repeatedly fails to start. Although Kubernetes will automatically keep trying to restart the container (with increasingly long intervals between restart attempts), it will eventually give up – or "back off" – if it has been five minutes since the last restart attempt and the container still fails to start. Read more about troubleshooting CrashLoopBackOff.

How to Fix CrashLoopBackOff

CrashLoopBackOff events can occur for a variety of reasons, and they are therefore one of the more difficult problems to troubleshoot in Kubernetes. But to resolve them, start by checking for the most obvious causes of CrashLoopBackOff, which include lack of sufficient resources, broken deployment configurations, and problems with the application or image you’re trying to run.

ImagePullBackOff

An ImagePullBackOff error means that Kubernetes couldn't pull the image for a container. As with CrashLoopBackOffs, Kubernetes will repeatedly retry to pull an image if it fails on the first attempt, but eventually it gives up.

How to Fix ImagePullBackOff

ImagePullBackOff usually happens either because your deployment configuration doesn't point to the right image registry or path, or because there is an issue (like lack of network connectivity) with your registry.

So, to fix the problem, first make sure the container image path is properly configured for your Deployment. If it is, try pulling the image manually to check for network connectivity issues.

Node Not Ready

The Node Not Ready error appears when a node in your Kubernetes cluster fails to reach the "ready" state, which is the state it needs to be in to host workloads. This typically happens because of insufficient resources on the node or an issue starting the kubelet agent on the node.

How to Fix Node Not Ready

The best way to troubleshoot and fix this type of node status issue is to check the kubelet logs of affected worker nodes, as well as any operating system logs on the node, for information about why the node is failing to achieve a ready state.

CreateContainerConfigError

CreateContainerConfigError means that a container that was in a pending state failed to transition successfully to a running state. This is usually due to missing information in the deployment configuration for the container, such as a problem with a Secret or ConfirMap that the container depends on.

How to Fix CreateContainerConfigError

To fix the issue, run kubectl describe pod-name to view details about any ConfigMaps associated with the Pod. Make sure any referenced ConfigMaps exist and that the Pod has permissions to access them. You can also use kubectl describe secret secret-name to view information about Secrets, they verify that they are properly configured.

Read more in our guide to fixing CreateContainerConfigError.

Kubernetes OOMKilled

A Kubernetes OOMKilled error indicates that a container was shut down because it was using more memory than allowed.

How to Fix Kubernetes OOMKilled

To troubleshoot this error, check whether any memory limits are in place for the container and whether they are appropriate for the container's requirements.

You should also make sure you've defined the right Quality of Service (QoS) class for the Pod in question. There are three QoS classes – Guaranteed, Burstable, and Best Effort – and if a cluster runs short on memory, Kubernetes may terminate containers that don't belong to the Guaranteed class in order to free up memory for ones that do. 

Read more about troubleshooting Kubernetes OOMKilled errors

Troubleshooting Kubernetes Clusters

Now that we've discussed how to troubleshoot specific Kubernetes errors, let's talk about how to troubleshoot issues that affect different components of Kubernetes, starting with the cluster as a whole.

If you're experiencing a performance issue across your entire cluster, as opposed to an individual node or Pod, the likeliest cause is a problem with your control plane. Check the logs on the control plane node or nodes for any unexpected events, such as a network connectivity issue.

You should also ensure that the size of your cluster and the overall resource availability is sufficient for your workloads. If your workloads have experienced a surge in requests, or if the number of existing Pods exceeds what your nodes can support, you might need to allocate more nodes, or change the resource allocations of existing nodes (if they are VMs with variable allocations or configurations), to deliver the resources your cluster needs to run reliably.

Troubleshooting Kubernetes Pods

To troubleshoot issues that affect specific Pods, such as frequent crashes, start by running the kubectl describe pod command, which uses this syntax:

kubectl describe pods pod-name

The output will include information about the status of the Pod. Ideally, your Pod will be in the Ready state, which means it's operating normally. But if it's stuck in a state like PodHasNetwork, it means that it's connected to the network but has not yet started all of its containers – an indication that there is probably an issue getting one or more containers in the Pod up and running.

Any logs or metrics that you can collect from the Pod are also valuable for troubleshooting. Although Kubernetes doesn't create Pod logs directly, you can use an observability tool to monitor the resource consumption of Pods. In addition, containers in the Pod may be configured to write log files or export them to a logging tool.

When troubleshooting Pods, it may also help to try starting the containers inside the Pod directly from the command line. If they all start successfully, this rules out problems with the container images themselves.

In some cases, simply deleting and redeploying a Pod may fix unusual issues. You can delete a Pod with the command kubectl delete pods.

If you've configured the replication controller manager to run many copies of a Pod, that could create issues in situations where there aren't enough nodes to support the defined number of replicas. In that case, simply change the replication controller settings.

Finally, if you discover that Pods running on a certain node keep crashing, you should remove that node from your cluster with the command kubectl delete node.

Strategies for Troubleshooting Kubernetes Issues Effectively

We wish we could reveal the "one simple trick" that will make it super-easy to solve all of your Kubernetes troubleshooting woes. But the complexity of Kubernetes means that fixing issues is never that simple. Every problem is different, and you often need to be creative about your approach to Kubernetes troubleshooting to solve strange issues.

That said, you can streamline your approach to Kubernetes troubleshooting by following these guidelines:

  • Define the scope of the problem: Before you do anything else, figure out how many resources are affected by the issue. Is it just a single Pod or node, or are you seeing unusual activity across large parts of your cluster? The scope of the problem helps you determine whether the root cause is linked to a component that affects the entire cluster, or just a specific workload or node.
  • Look for error codes: Again, any error codes you can pull from Kubernetes itself are often your best starting-point for identifying likely root causes of problems.
  • Identify data sources: The logs and metrics available for troubleshooting Kubernetes can vary widely depending on which observability software you've deployed and which logging options you've configured. Determine which data is available to you, since that information will play a central role in shaping your troubleshooting options.

Take advantage of generic mitigations: A generic mitigation is an action like redeploying a Pod or allocating more CPU or memory to a container. It doesn't resolve any underlying problems with your workloads or Kubernetes clusters, but it sometimes does get things working again in some cases. Although you should still endeavor to figure out what the root cause of a performance problem in Kubernetes so it doesn't keep coming back, generic mitigations can at least get applications back up and running again for your users.

Tools and Techniques for Efficient Kubernetes Troubleshooting

The massive popularity of Kubernetes means that there is no shortage of troubleshooting tools and resources available. In general, the tools fall into two main categories:

  • Troubleshooting tools that help monitor and fix performance issues with specific components of Kubernetes. For example, netshoot helps fix network-related problems. The tooling built into kubectl for describing Pods, nodes and so on also fits in this category, since it provides basic data about specific types of Kubernetes components that is useful when troubleshooting.
  • End-to-end observability and troubleshooting platforms, like groundcover, that continuously monitor all components of your Kubernetes cluster and provide the data necessary to contextualize complex problems.

The first set of tools is useful if you experience an issue with a narrow scope and need to trace its root source. But for complex issues whose scope and root cause are not at all obvious from the surface, a holistic observability solution is usually your best bet for getting to the root of the problem.

By the same token, complex issues typically require a broad troubleshooting technique that draws on as much data available to you as possible. The more data you have about each element in Kubernetes, the better positioned you are to associate unusual performance in one component with anomalies from other components, and to rule out different potential root causes of an issue.

Additional Kubernetes Errors and Issues

The advice we've offered above focuses on troubleshooting specific types of errors or exit events in Kubernetes. For the most part, these are the issues you'll run into when dealing with Pods, containers, nodes, and the control plane.

However, there are some additional errors and issues you may encounter that can't be represented by a simple code or error event. Here's a look at those problems and how to troubleshoot them.

Sudden Jumps in Load and Scale

You might notice that the resource utilization of some Pods or containers jumps up and down severely on a frequent basis. No error event will occur as long as Kubernetes is able to handle the shifts in load, but this behavior is still something you'll want to troubleshoot because it could lead to inefficient resource utilization within your cluster (since having to scale Pods up and down constantly may deprive other Pods of stable access to resources). It could also be an early warning of workload instability that, if left unaddressed, will eventually lead to a crash.

To troubleshoot the issue, first see if you can correlate the jumps with other events that might explain them. For example, if jumps occur whenever a Pod is rescheduled, you can conclude that the Pod probably requires a lot of resources at startup time. In that case, reconfiguring the Pod to start up more gracefully (but, for instance, reducing the amount of code that needs to run when its containers start) could help achieve more stable performance.

Running traces may also help you to pinpoint the reasons why a Pod or container experiences sudden jumps in resource utilization. For instance, certain types of requests may trigger high resource utilization when they reach a particular microservice within your app due to inefficient processing by the microservice. In that case, you'd want to update your app's code to fix the issue.

Poor Network Performance

Sometimes, you run into networking issues in Kubernetes that don't trigger specific errors, but that degrade the overall performance of your cluster.

Kubernetes network troubleshooting is a complex topic, and a complete guide is beyond the scope of this article. But to get started, you'll want to perform the following steps:

  • Determine whether networking performance problems are consistent across your entire cluster or limited to certain workloads. In the latter case, your issue is most likely related to how you've configured Services, load balancers or other networking resources that you've assigned to specific workloads. In the former case, it might be a problem with your Container Network Interface (CNI) or possibly your Internet connection as a whole.
  • Use the traceroute command to examine traffic flows within your cluster. Traceroute can help pinpoint where packets are being held up, which is another useful way of determining what the scope and root cause of a networking issue is.
  • Examine logs and metrics for Pods, containers, and nodes to make sure none of them are out of resources. If they are, you could experience behavior that looks on the surface like a networking issue, but is actually caused by your workloads being unable to move packets quickly enough because they are starved of CPU or memory resources.

Secrets Access Problems

Secrets, which store sensitive data, are often used by Pods or other resources in Kubernetes for authentication or authorization. If Secrets are misconfigured, Pods may not be able to start properly.

To troubleshoot Secrets issues, start by pulling details about Secrets using a command like the following:

The output will include information such as which namespace the Secret is associated with and any annotations you've assigned. Verify that this information is correct. Sometimes, Secrets problems boil down to small but easily overlooked issues, such as associating a Secret with the wrong namespace.

Infrastructure Issues

Sometimes, problems in Kubernetes stem not from Kubernetes itself, but from issues with the underlying infrastructure platform you're using to host your cluster.

The best way to troubleshoot this type of problem depends on which infrastructure you are using. If you're operating a cluster using cloud-based servers, such as VMs running in Amazon EC2, check performance metrics from the cloud provider (or a third-party monitoring tool that supports the cloud environment) to determine whether you have issues like misconfigured servers or network connectivity problems. Similarly, if you're running Kubernetes on your own on-prem servers, check logs and metrics from the individual servers to assess infrastructure health.

Getting to the Root Cause Isn’t Always Easy

Kubernetes is a very complex system, which often makes it difficult to get to the root cause of performance problems. However, Kubernetes error codes offer a good starting point for investigating many types of problems. You should also draw on logs, metrics, and any other observability data sources available to you to help pinpoint the main cause of an issue.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.

We care about data. Check out our privacy policy.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.