Kubernetes Troubleshooting: 20 Production Issues We See Every Week (2026)

Every week, our engineers at Tasrie IT Services work through the same pattern: a client’s on-call engineer pages us, something is broken in production, and the clock is ticking. After hundreds of these engagements across EKS, AKS, and GKE clusters, we have catalogued the issues that appear with almost mechanical regularity.

This guide distils that experience into a systematic framework. We cover the 20 production issues we encounter most often, the five-layer debugging methodology we teach every new engineer, and the triage framework that keeps our incident response fast and focused. According to the official Kubernetes blog, misconfigurations are the root cause of up to 80 per cent of Kubernetes stability and security incidents — so most of what follows is eminently preventable.

The Triage Priority Framework

Before diving into individual issues, you need a shared language for severity. We use a four-tier priority system on every engagement. It keeps conversations short during incidents and ensures the right people are pulled in at the right time.

Priority	Scope	Examples	Target Response
P0 — Cluster Down	Control plane or entire cluster unreachable	API server unavailable, etcd quorum lost, all nodes NotReady	Immediate (all hands)
P1 — Service Outage	User-facing service fully impacted	All pods in CrashLoopBackOff, cluster-wide DNS failure, ingress returning 503	Under 15 minutes
P2 — Degraded Performance	Latency spikes or partial failures	CPU throttling, HPA not scaling, intermittent OOMKills	Under 1 hour
P3 — Warning / Non-Critical	Potential future issue detected	Node pressure building, PVC nearing capacity, certificate expiring soon	Under 24 hours

The value of this framework is that it forces you to classify before you act. We have seen too many teams spend an hour debugging a P3 warning while a P1 service outage goes unnoticed in another namespace.

The Five-Layer Debugging Methodology

When something breaks, resist the urge to start reading random pod logs. We teach a top-down, layer-by-layer approach that ensures you never miss the obvious.

Layer 1: Cluster Health

kubectl cluster-info
kubectl get nodes
kubectl top nodes

If nodes are NotReady or the API server is unresponsive, nothing else matters. Start here. For a deeper look at node resource consumption, see our guide on checking node CPU and memory utilisation in Kubernetes.

Layer 2: Workload Status

kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
kubectl get deployments -A

This surfaces every pod that is not in a healthy Running state and shows recent events across all namespaces.

Layer 3: Deep Dive (Get, Describe, Logs)

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

The --previous flag retrieves logs from the last terminated container instance, which is essential when debugging CrashLoopBackOff.

Layer 4: Network and Service Verification

kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
kubectl exec -it <debug-pod> -- nslookup <service-name>
kubectl exec -it <debug-pod> -- curl <service-name>:<port>

This layer validates that DNS resolution works, services have endpoints, and traffic flows correctly.

Layer 5: Node and Infrastructure

kubectl debug node/<node-name> -it --image=busybox
# Inside the debug pod:
chroot /host
journalctl -u kubelet -n 100
df -h
free -m

This is where you go when the problem sits beneath Kubernetes itself — kubelet crashes, disk pressure, or kernel issues.

Pod Failures

1. CrashLoopBackOff

The single most common issue we encounter. A container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, capped at five minutes).

Common causes: application bugs, wrong entrypoint command, missing environment variables, port conflicts, insufficient memory causing the process to die before the OOM killer even fires, and read-only filesystem errors.

Debug steps:

kubectl describe pod <pod-name>       # Check Events section
kubectl logs <pod-name> --previous    # Logs from the crashed instance

The Events section in describe often reveals the root cause faster than logs do. Look for exit codes — they tell you exactly what happened (see the exit code reference table at the end of this post).

2. OOMKilled (Exit Code 137)

Exit code 137 means the container received SIGKILL (128 + 9), almost always triggered by the Linux OOM killer because the container exceeded its memory limit.

Common causes: memory leaks, JVM heap sized larger than the container limit, unbounded caching, or simply underestimated memory requirements.

Debug steps:

kubectl describe pod <pod-name>   # Look for "OOMKilled" in Last State
kubectl top pod <pod-name>        # Current memory usage

If you are seeing OOMKills regularly, consider implementing a Vertical Pod Autoscaler to right-size your resource allocations automatically.

3. ImagePullBackOff

Kubernetes cannot pull the container image from the registry.

Common causes: typo in the image name or tag, private registry without an imagePullSecret, Docker Hub rate limiting, deleted image tag, or network policy blocking egress to the registry.

Debug steps:

kubectl describe pod <pod-name>   # Check the Events for the exact error
kubectl get secret -n <namespace> # Verify imagePullSecrets exist

4. CreateContainerConfigError

This error appears when Kubernetes cannot generate the container configuration, so it never even attempts to start the container.

Common causes: a referenced ConfigMap or Secret does not exist, the name is misspelt, or the resource is in a different namespace.

Debug steps:

kubectl describe pod <pod-name>
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>

We have written extensively about Kubernetes ConfigMap best practices if you want a deeper dive into avoiding these misconfigurations.

5. Liveness and Readiness Probe Failures

Misconfigured probes are insidious because they cause cascading failures. A failing liveness probe restarts the container; a failing readiness probe removes the pod from Service endpoints.

Common causes: wrong probe path or port, initialDelaySeconds too short for the application’s startup time, heavyweight health endpoints that check all downstream dependencies, and CPU throttling causing the probe to time out.

Key principle: liveness probes should answer “is this process fundamentally broken?” — not “are all my dependencies working?” Use readiness probes (not liveness) for dependency checks. For slow-starting applications, add a startup probe to prevent premature liveness failures.

Scheduling and Node Issues

6. Pod Stuck in Pending

The scheduler cannot find a node that satisfies the pod’s requirements.

Common causes: insufficient CPU or memory on available nodes, taint and toleration mismatches, nodeSelector or nodeAffinity conflicts, PVC bound to a node in a different availability zone, or the cluster has hit its maximum pods-per-node limit.

Debug steps:

kubectl describe pod <pod-name>   # Check the Events for scheduling failures
kubectl describe nodes             # Review allocatable vs allocated resources
kubectl get events --field-selector reason=FailedScheduling

If scheduling failures involve taints, our guide on Kubernetes taints and tolerations walks through the mechanics in detail.

7. Node NotReady

A node stops communicating with the control plane and transitions to NotReady status.

Common causes: kubelet crash, node resource exhaustion, network partition between the node and the API server, cloud provider instance failure, or certificate expiration.

Debug steps:

kubectl describe node <node-name>   # Check Conditions and Events
kubectl debug node/<node-name> -it --image=busybox

We have published a dedicated deep dive on Kubernetes Node NotReady troubleshooting that covers every scenario we have encountered.

8. Node Pressure Eviction

The kubelet monitors memory, disk, and PID resources on each node. When consumption crosses eviction thresholds, it proactively terminates pods to reclaim resources.

Common causes: application logs writing to the container’s writable layer instead of stdout, container image layer accumulation, memory leaks, and PID exhaustion from connection leaks or fork bombs.

Debug steps:

kubectl describe nodes | grep -A5 "Conditions:"
kubectl top pods --all-namespaces --sort-by=memory | head -20

The kubelet evicts pods with the lowest Quality of Service (QoS) class first. If your critical services lack resource requests, they fall into the BestEffort QoS class and are evicted before less important workloads that happen to have requests defined.

9. Cluster Autoscaler Not Scaling

Pending pods should trigger node provisioning, but new nodes never appear.

Common causes (scale-up): the pod’s resource request exceeds the capacity of any available machine type, missing Auto Scaling Group tags (AWS), the node group has reached its maximum size, or a PodDisruptionBudget is blocking the operation.

Common causes (scale-down): PodDisruptionBudgets, pods with local storage, system pods without a PDB, or pod annotations preventing eviction.

Debug steps:

kubectl logs -f deployment/cluster-autoscaler -n kube-system

The Cluster Autoscaler logs are remarkably informative. They tell you exactly why a scale-up or scale-down decision was or was not made.

Networking Issues

10. DNS Resolution Failures

DNS is the backbone of service discovery in Kubernetes. When CoreDNS fails, every service-to-service call breaks.

Common causes: CoreDNS pods not running or crash-looping, network policies blocking UDP port 53, incorrect /etc/resolv.conf in pods, the ndots setting causing excessive upstream lookups, or a corrupted CoreDNS ConfigMap.

Debug steps:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl run dnstest --image=busybox:1.28 --rm -it -- nslookup kubernetes.default
kubectl logs -n kube-system -l k8s-app=kube-dns

DNS issues are responsible for some of the most severe cascading failures we have seen. We discuss this further in the war stories section below.

11. Service Not Routing Traffic (Empty Endpoints)

The Service resource exists, but no traffic reaches the pods behind it.

Common causes: label selector mismatch between the Service and the pods, readiness probe failures removing all pods from the endpoint list, incorrect port or targetPort configuration, or a namespace mismatch.

Debug steps:

kubectl get endpoints <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get pods -n <namespace> --show-labels

If the endpoints list is empty, compare the Service’s selector labels against the pod labels character by character. A single typo is enough. For a broader understanding of Kubernetes networking, see our CNI and service mesh guide.

12. Ingress 502, 503, and 504 Errors

These three HTTP error codes each point to a different failure mode:

502 Bad Gateway: the ingress controller could not reach the backend at all (no pod running or the wrong serviceName/servicePort)
503 Service Unavailable: no endpoints behind the service, typically because all readiness probes are failing
504 Gateway Timeout: the backend pod did not respond within the configured timeout (slow database query, hanging process)

Debug steps:

kubectl logs -n ingress-nginx <ingress-controller-pod>
kubectl get endpoints <backend-service>
kubectl describe ingress <ingress-name>

During rolling deployments, brief 502 errors can occur if pods are terminated before connections drain. Configure preStop hooks and tune terminationGracePeriodSeconds to mitigate this.

Storage Issues

13. PVC Stuck in Pending

A PersistentVolumeClaim cannot bind to a PersistentVolume.

Common causes: no matching StorageClass, capacity mismatch (the PVC requests more storage than any available PV offers), access mode mismatch (e.g., requesting ReadWriteMany on a storage backend that only supports ReadWriteOnce), zone or topology constraints, the storage provisioner not running, or resource quotas exceeded.

Debug steps:

kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass

The Events section of describe pvc almost always reveals the exact reason for the pending state.

14. Volume Mount Errors (FailedMount)

The pod starts, but the volume cannot be attached or mounted.

Common causes: the volume is already attached to a different node (common with AWS EBS, which does not support multi-attach by default), filesystem corruption, NFS server unreachable, wrong fsGroup or securityContext, or CSI driver issues.

Debug steps:

kubectl describe pod <pod-name>   # Look for FailedMount in Events
kubectl get volumeattachments      # Check attachment state

Security and RBAC Issues

15. RBAC Forbidden Errors

The classic “User cannot list resource in API group at the cluster scope” message.

Common causes: missing Role or ClusterRole, missing RoleBinding or ClusterRoleBinding, wrong service account reference, or attempting to use a namespace-scoped Role for a cluster-scoped resource.

Debug steps:

kubectl auth can-i list pods --as=system:serviceaccount:<ns>:<sa-name>
kubectl get clusterrolebindings | grep <role-name>
kubectl describe rolebinding -n <namespace>

The kubectl auth can-i command is the fastest way to verify permissions. We have published a comprehensive Kubernetes RBAC audit guide for teams that need to tighten their access controls systematically.

16. Security Context and Admission Failures

Pods rejected by admission controllers because they violate security policies.

Common causes: running as root when the Pod Security Standard restricts it, requesting a privileged container, using hostNetwork or hostPID when not permitted, or missing required security contexts.

Debug steps:

kubectl describe pod <pod-name>    # Admission rejection in Events
kubectl get podsecuritypolicies    # If still using PSPs
kubectl label ns <namespace> --list | grep pod-security

For a broader treatment of Kubernetes security, see our Kubernetes security best practices guide.

Control Plane Issues

17. API Server Unavailability or Slowness

When kubectl commands timeout or fail, the API server is usually the culprit.

Common causes: etcd performance degradation, excessive LIST calls on large objects, certificate expiration, resource exhaustion on control plane nodes, or too many active watchers overwhelming the API server.

Debug steps:

kubectl cluster-info
kubectl get --raw /healthz

On managed Kubernetes services (EKS, AKS, GKE), API server issues are typically handled by the cloud provider, but you should still monitor API server latency metrics. See our EKS vs AKS vs GKE comparison for how each provider handles control plane availability differently.

18. etcd Issues (Quorum Loss, Storage Full)

etcd is the single source of truth for all cluster state. When it is unhealthy, writes fail silently — pods cannot be scheduled, ConfigMaps cannot be updated, and scaling operations hang.

Common causes: loss of quorum (two out of three control plane nodes down), database hitting its size limit (default 2 GB, maximum 8 GB), slow disk I/O on etcd nodes, or compaction not running.

Debug steps:

etcdctl endpoint health
etcdctl endpoint status --write-out=table
etcdctl alarm list

If etcd has hit its storage limit, you must run compaction followed by defragmentation before the cluster can accept writes again. This is a P0 scenario.

HPA and Autoscaling Failures

19. HPA Not Scaling

The Horizontal Pod Autoscaler sees metrics but refuses to scale, or it cannot fetch metrics at all.

Common causes: the metrics-server is not running or misconfigured, pod containers lack CPU or memory requests (HPA calculates utilisation as a percentage of requests), the target utilisation is set too high or too low, or node capacity is exhausted so new replicas cannot be scheduled.

Debug steps:

kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"

20. CPU Throttling Causing Latency

This one is subtle and commonly overlooked. Even when average CPU utilisation is low, the Linux Completely Fair Scheduler (CFS) can throttle burst workloads that exceed the CPU limit within a single quota period.

Common causes: CPU limits set too aggressively relative to burst patterns, the CFS quota period (typically 100ms) being too short for bursty workloads, and teams confusing average utilisation with peak utilisation.

Key metric to watch:

container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total

If this ratio exceeds 25 per cent, your workloads are experiencing significant throttling. Consider raising or removing CPU limits for latency-sensitive services while keeping CPU requests in place for scheduling purposes.

Essential Troubleshooting Tools

CLI Tools

Tool	Purpose
kubectl debug	Attach ephemeral containers to running pods or debug nodes directly — essential for distroless images
k9s	Terminal-based interactive dashboard with real-time resource viewing and keyboard-driven navigation
stern	Multi-pod, multi-container log tailing with colour-coded output per pod
crictl	Container runtime CLI for node-level debugging (`crictl ps`, `crictl logs`, `crictl inspect`)

Network Debugging

Tool	Purpose
netshoot	Swiss Army knife container image with tcpdump, netstat, iptables, curl, dig, and more
dnsutils	Lightweight pod image for DNS debugging with nslookup and dig

Observability Stack

For production clusters, we always recommend a comprehensive monitoring stack. Our Prometheus monitoring guide for Kubernetes covers the full setup, but at a minimum you need:

Prometheus for metrics collection and alerting (CPU throttling, OOM events, pod restarts)
Grafana for visualisation dashboards
A log aggregation solution (Loki, Fluentd, or Fluent Bit) for centralised log search

Real-World War Stories

These are drawn from our consulting engagements and from the excellent Kubernetes Failure Stories compilation maintained by the community.

War Story 1: The DNS Cascade

A client’s CoreDNS pods were evicted during a memory pressure event. With DNS down, every service-to-service call failed. Health checks that depended on downstream services then failed, triggering a wave of liveness probe restarts. The restarts increased DNS load on the one remaining CoreDNS pod, which also got evicted. Within four minutes, the entire namespace was down.

Root cause: CoreDNS had no PodDisruptionBudget, only two replicas, and no resource requests set (placing it in the BestEffort QoS class).

Fix: We set resource requests and limits on CoreDNS, added a PDB with minAvailable: 2, increased the replica count to three, and moved CoreDNS to a dedicated node pool with guaranteed resources.

War Story 2: The Invisible CPU Throttle

A team reported P99 latency of 800ms on an API that should respond in under 50ms. CPU utilisation graphs showed only 5 per cent average usage. Everything looked healthy.

The issue was CPU limits set to 200m. The CFS quota of 20ms per 100ms period meant that any request processing burst exceeding 20ms of CPU time was throttled until the next period. We found container_cpu_cfs_throttled_periods_total at 60 per cent.

Fix: We removed CPU limits entirely (keeping CPU requests at 200m for scheduling) and P99 latency dropped to 35ms. The CNCF troubleshooting guide confirms this is one of the most under-diagnosed production issues.

War Story 3: Pod Priority Preemption Gone Wrong

Inspired by the documented Grafana Labs incident, a client deployed a high-priority batch processing job without understanding the implications. The batch job preempted production API pods. The evicted API pods attempted to reschedule, but the batch job had consumed all available node resources. The production API was down for 25 minutes.

Fix: We introduced PodDisruptionBudgets for all production workloads, created a dedicated node pool for batch processing with appropriate taints, and implemented a PriorityClass hierarchy that explicitly prevented batch jobs from preempting production services.

Proactive Monitoring Checklist

The best troubleshooting is the kind you never have to do. Here are the alerts we deploy on every client cluster from day one:

Pod Health:

Pod restart count > 3 in 15 minutes
Pod in CrashLoopBackOff for > 5 minutes
OOMKilled events
CPU throttling ratio > 25 per cent

Node Health:

Node NotReady for > 2 minutes
Node memory utilisation > 85 per cent
Node disk utilisation > 80 per cent
Node PID count approaching limit

Cluster Components:

CoreDNS pod count < desired replicas
CoreDNS latency > 500ms
etcd database size > 50 per cent of quota
API server request latency P99 > 1 second
Certificate expiry < 30 days

Autoscaling:

HPA at maximum replicas for > 10 minutes
Cluster Autoscaler pending pods for > 5 minutes
PVC utilisation > 80 per cent

Setting these up requires a working Prometheus and Grafana stack. We consider this non-negotiable for any production cluster.

Container Exit Code Reference

When a container terminates, the exit code tells you exactly what happened. This table saves hours of guesswork.

Exit Code	Signal	Meaning
0	—	Clean exit (success)
1	—	Application error (generic)
2	—	Shell misuse or invalid argument
125	—	Container runtime error (failed to start)
126	—	Command invoked cannot execute (permission issue)
127	—	Command not found (wrong entrypoint or missing binary)
128	—	Invalid exit argument
137	SIGKILL (9)	OOMKilled or `kubectl delete --force`
139	SIGSEGV (11)	Segmentation fault (application crash)
143	SIGTERM (15)	Graceful termination requested (normal pod shutdown)

Exit code 137 is by far the most common non-zero code we encounter. If you see it, check memory limits first.

Reduce Production Incidents With Expert Kubernetes Support

Troubleshooting production Kubernetes issues under pressure is stressful, and every minute of downtime costs money. The patterns in this guide are preventable with the right architecture, monitoring, and operational practices in place from the start.

Our team provides comprehensive Kubernetes consulting services to help you:

Architect resilient clusters with proper resource management, PodDisruptionBudgets, and priority class hierarchies
Implement proactive monitoring with Prometheus, Grafana, and custom alerting tailored to your workloads
Train your on-call engineers with runbooks, triage frameworks, and hands-on incident simulation exercises

We have managed hundreds of production clusters across EKS, AKS, and GKE. Whether you need a one-time architecture review or ongoing managed support, we can help you move from reactive firefighting to proactive reliability engineering.

Explore our Kubernetes consulting services

The Triage Priority Framework

The Five-Layer Debugging Methodology

Layer 1: Cluster Health

Layer 2: Workload Status

Layer 3: Deep Dive (Get, Describe, Logs)

Layer 4: Network and Service Verification

Layer 5: Node and Infrastructure

Pod Failures

1. CrashLoopBackOff

2. OOMKilled (Exit Code 137)

3. ImagePullBackOff

4. CreateContainerConfigError

5. Liveness and Readiness Probe Failures

Scheduling and Node Issues

6. Pod Stuck in Pending

7. Node NotReady

8. Node Pressure Eviction

9. Cluster Autoscaler Not Scaling

Networking Issues

10. DNS Resolution Failures

11. Service Not Routing Traffic (Empty Endpoints)

12. Ingress 502, 503, and 504 Errors

Storage Issues

13. PVC Stuck in Pending

14. Volume Mount Errors (FailedMount)

Security and RBAC Issues

15. RBAC Forbidden Errors

16. Security Context and Admission Failures

Control Plane Issues

17. API Server Unavailability or Slowness

18. etcd Issues (Quorum Loss, Storage Full)

HPA and Autoscaling Failures

19. HPA Not Scaling

20. CPU Throttling Causing Latency

Essential Troubleshooting Tools

CLI Tools

Network Debugging

Observability Stack

Real-World War Stories

War Story 1: The DNS Cascade

War Story 2: The Invisible CPU Throttle

War Story 3: Pod Priority Preemption Gone Wrong

Proactive Monitoring Checklist

Container Exit Code Reference

Reduce Production Incidents With Expert Kubernetes Support

Related Articles

Helm vs Kustomize: We Manage 100+ Clusters - Here's What We Actually Use (2026)

Internal Developer Platform: We Built IDPs for 50+ Teams (Guide)

Kubernetes Cluster Upgrades: We Tested 3 Strategies - Here's What Won

Application Security Monitoring 2026: Complete Guide to Securing Modern Applications

Kubectl Create Namespace 2026: Complete Guide to Kubernetes Namespaces

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation