Every week, our engineers at Tasrie IT Services work through the same pattern: a client’s on-call engineer pages us, something is broken in production, and the clock is ticking. After hundreds of these engagements across EKS, AKS, and GKE clusters, we have catalogued the issues that appear with almost mechanical regularity.
This guide distils that experience into a systematic framework. We cover the 20 production issues we encounter most often, the five-layer debugging methodology we teach every new engineer, and the triage framework that keeps our incident response fast and focused. According to the official Kubernetes blog, misconfigurations are the root cause of up to 80 per cent of Kubernetes stability and security incidents — so most of what follows is eminently preventable.
The Triage Priority Framework
Before diving into individual issues, you need a shared language for severity. We use a four-tier priority system on every engagement. It keeps conversations short during incidents and ensures the right people are pulled in at the right time.
| Priority | Scope | Examples | Target Response |
|---|---|---|---|
| P0 — Cluster Down | Control plane or entire cluster unreachable | API server unavailable, etcd quorum lost, all nodes NotReady | Immediate (all hands) |
| P1 — Service Outage | User-facing service fully impacted | All pods in CrashLoopBackOff, cluster-wide DNS failure, ingress returning 503 | Under 15 minutes |
| P2 — Degraded Performance | Latency spikes or partial failures | CPU throttling, HPA not scaling, intermittent OOMKills | Under 1 hour |
| P3 — Warning / Non-Critical | Potential future issue detected | Node pressure building, PVC nearing capacity, certificate expiring soon | Under 24 hours |
The value of this framework is that it forces you to classify before you act. We have seen too many teams spend an hour debugging a P3 warning while a P1 service outage goes unnoticed in another namespace.
The Five-Layer Debugging Methodology
When something breaks, resist the urge to start reading random pod logs. We teach a top-down, layer-by-layer approach that ensures you never miss the obvious.
Layer 1: Cluster Health
kubectl cluster-info
kubectl get nodes
kubectl top nodes
If nodes are NotReady or the API server is unresponsive, nothing else matters. Start here. For a deeper look at node resource consumption, see our guide on checking node CPU and memory utilisation in Kubernetes.
Layer 2: Workload Status
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
kubectl get deployments -A
This surfaces every pod that is not in a healthy Running state and shows recent events across all namespaces.
Layer 3: Deep Dive (Get, Describe, Logs)
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
The --previous flag retrieves logs from the last terminated container instance, which is essential when debugging CrashLoopBackOff.
Layer 4: Network and Service Verification
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
kubectl exec -it <debug-pod> -- nslookup <service-name>
kubectl exec -it <debug-pod> -- curl <service-name>:<port>
This layer validates that DNS resolution works, services have endpoints, and traffic flows correctly.
Layer 5: Node and Infrastructure
kubectl debug node/<node-name> -it --image=busybox
# Inside the debug pod:
chroot /host
journalctl -u kubelet -n 100
df -h
free -m
This is where you go when the problem sits beneath Kubernetes itself — kubelet crashes, disk pressure, or kernel issues.
Pod Failures
1. CrashLoopBackOff
The single most common issue we encounter. A container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, capped at five minutes).
Common causes: application bugs, wrong entrypoint command, missing environment variables, port conflicts, insufficient memory causing the process to die before the OOM killer even fires, and read-only filesystem errors.
Debug steps:
kubectl describe pod <pod-name> # Check Events section
kubectl logs <pod-name> --previous # Logs from the crashed instance
The Events section in describe often reveals the root cause faster than logs do. Look for exit codes — they tell you exactly what happened (see the exit code reference table at the end of this post).
2. OOMKilled (Exit Code 137)
Exit code 137 means the container received SIGKILL (128 + 9), almost always triggered by the Linux OOM killer because the container exceeded its memory limit.
Common causes: memory leaks, JVM heap sized larger than the container limit, unbounded caching, or simply underestimated memory requirements.
Debug steps:
kubectl describe pod <pod-name> # Look for "OOMKilled" in Last State
kubectl top pod <pod-name> # Current memory usage
If you are seeing OOMKills regularly, consider implementing a Vertical Pod Autoscaler to right-size your resource allocations automatically.
3. ImagePullBackOff
Kubernetes cannot pull the container image from the registry.
Common causes: typo in the image name or tag, private registry without an imagePullSecret, Docker Hub rate limiting, deleted image tag, or network policy blocking egress to the registry.
Debug steps:
kubectl describe pod <pod-name> # Check the Events for the exact error
kubectl get secret -n <namespace> # Verify imagePullSecrets exist
4. CreateContainerConfigError
This error appears when Kubernetes cannot generate the container configuration, so it never even attempts to start the container.
Common causes: a referenced ConfigMap or Secret does not exist, the name is misspelt, or the resource is in a different namespace.
Debug steps:
kubectl describe pod <pod-name>
kubectl get configmap -n <namespace>
kubectl get secret -n <namespace>
We have written extensively about Kubernetes ConfigMap best practices if you want a deeper dive into avoiding these misconfigurations.
5. Liveness and Readiness Probe Failures
Misconfigured probes are insidious because they cause cascading failures. A failing liveness probe restarts the container; a failing readiness probe removes the pod from Service endpoints.
Common causes: wrong probe path or port, initialDelaySeconds too short for the application’s startup time, heavyweight health endpoints that check all downstream dependencies, and CPU throttling causing the probe to time out.
Key principle: liveness probes should answer “is this process fundamentally broken?” — not “are all my dependencies working?” Use readiness probes (not liveness) for dependency checks. For slow-starting applications, add a startup probe to prevent premature liveness failures.
Scheduling and Node Issues
6. Pod Stuck in Pending
The scheduler cannot find a node that satisfies the pod’s requirements.
Common causes: insufficient CPU or memory on available nodes, taint and toleration mismatches, nodeSelector or nodeAffinity conflicts, PVC bound to a node in a different availability zone, or the cluster has hit its maximum pods-per-node limit.
Debug steps:
kubectl describe pod <pod-name> # Check the Events for scheduling failures
kubectl describe nodes # Review allocatable vs allocated resources
kubectl get events --field-selector reason=FailedScheduling
If scheduling failures involve taints, our guide on Kubernetes taints and tolerations walks through the mechanics in detail.
7. Node NotReady
A node stops communicating with the control plane and transitions to NotReady status.
Common causes: kubelet crash, node resource exhaustion, network partition between the node and the API server, cloud provider instance failure, or certificate expiration.
Debug steps:
kubectl describe node <node-name> # Check Conditions and Events
kubectl debug node/<node-name> -it --image=busybox
We have published a dedicated deep dive on Kubernetes Node NotReady troubleshooting that covers every scenario we have encountered.
8. Node Pressure Eviction
The kubelet monitors memory, disk, and PID resources on each node. When consumption crosses eviction thresholds, it proactively terminates pods to reclaim resources.
Common causes: application logs writing to the container’s writable layer instead of stdout, container image layer accumulation, memory leaks, and PID exhaustion from connection leaks or fork bombs.
Debug steps:
kubectl describe nodes | grep -A5 "Conditions:"
kubectl top pods --all-namespaces --sort-by=memory | head -20
The kubelet evicts pods with the lowest Quality of Service (QoS) class first. If your critical services lack resource requests, they fall into the BestEffort QoS class and are evicted before less important workloads that happen to have requests defined.
9. Cluster Autoscaler Not Scaling
Pending pods should trigger node provisioning, but new nodes never appear.
Common causes (scale-up): the pod’s resource request exceeds the capacity of any available machine type, missing Auto Scaling Group tags (AWS), the node group has reached its maximum size, or a PodDisruptionBudget is blocking the operation.
Common causes (scale-down): PodDisruptionBudgets, pods with local storage, system pods without a PDB, or pod annotations preventing eviction.
Debug steps:
kubectl logs -f deployment/cluster-autoscaler -n kube-system
The Cluster Autoscaler logs are remarkably informative. They tell you exactly why a scale-up or scale-down decision was or was not made.
Networking Issues
10. DNS Resolution Failures
DNS is the backbone of service discovery in Kubernetes. When CoreDNS fails, every service-to-service call breaks.
Common causes: CoreDNS pods not running or crash-looping, network policies blocking UDP port 53, incorrect /etc/resolv.conf in pods, the ndots setting causing excessive upstream lookups, or a corrupted CoreDNS ConfigMap.
Debug steps:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl run dnstest --image=busybox:1.28 --rm -it -- nslookup kubernetes.default
kubectl logs -n kube-system -l k8s-app=kube-dns
DNS issues are responsible for some of the most severe cascading failures we have seen. We discuss this further in the war stories section below.
11. Service Not Routing Traffic (Empty Endpoints)
The Service resource exists, but no traffic reaches the pods behind it.
Common causes: label selector mismatch between the Service and the pods, readiness probe failures removing all pods from the endpoint list, incorrect port or targetPort configuration, or a namespace mismatch.
Debug steps:
kubectl get endpoints <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get pods -n <namespace> --show-labels
If the endpoints list is empty, compare the Service’s selector labels against the pod labels character by character. A single typo is enough. For a broader understanding of Kubernetes networking, see our CNI and service mesh guide.
12. Ingress 502, 503, and 504 Errors
These three HTTP error codes each point to a different failure mode:
- 502 Bad Gateway: the ingress controller could not reach the backend at all (no pod running or the wrong
serviceName/servicePort) - 503 Service Unavailable: no endpoints behind the service, typically because all readiness probes are failing
- 504 Gateway Timeout: the backend pod did not respond within the configured timeout (slow database query, hanging process)
Debug steps:
kubectl logs -n ingress-nginx <ingress-controller-pod>
kubectl get endpoints <backend-service>
kubectl describe ingress <ingress-name>
During rolling deployments, brief 502 errors can occur if pods are terminated before connections drain. Configure preStop hooks and tune terminationGracePeriodSeconds to mitigate this.
Storage Issues
13. PVC Stuck in Pending
A PersistentVolumeClaim cannot bind to a PersistentVolume.
Common causes: no matching StorageClass, capacity mismatch (the PVC requests more storage than any available PV offers), access mode mismatch (e.g., requesting ReadWriteMany on a storage backend that only supports ReadWriteOnce), zone or topology constraints, the storage provisioner not running, or resource quotas exceeded.
Debug steps:
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass
The Events section of describe pvc almost always reveals the exact reason for the pending state.
14. Volume Mount Errors (FailedMount)
The pod starts, but the volume cannot be attached or mounted.
Common causes: the volume is already attached to a different node (common with AWS EBS, which does not support multi-attach by default), filesystem corruption, NFS server unreachable, wrong fsGroup or securityContext, or CSI driver issues.
Debug steps:
kubectl describe pod <pod-name> # Look for FailedMount in Events
kubectl get volumeattachments # Check attachment state
Security and RBAC Issues
15. RBAC Forbidden Errors
The classic “User cannot list resource in API group at the cluster scope” message.
Common causes: missing Role or ClusterRole, missing RoleBinding or ClusterRoleBinding, wrong service account reference, or attempting to use a namespace-scoped Role for a cluster-scoped resource.
Debug steps:
kubectl auth can-i list pods --as=system:serviceaccount:<ns>:<sa-name>
kubectl get clusterrolebindings | grep <role-name>
kubectl describe rolebinding -n <namespace>
The kubectl auth can-i command is the fastest way to verify permissions. We have published a comprehensive Kubernetes RBAC audit guide for teams that need to tighten their access controls systematically.
16. Security Context and Admission Failures
Pods rejected by admission controllers because they violate security policies.
Common causes: running as root when the Pod Security Standard restricts it, requesting a privileged container, using hostNetwork or hostPID when not permitted, or missing required security contexts.
Debug steps:
kubectl describe pod <pod-name> # Admission rejection in Events
kubectl get podsecuritypolicies # If still using PSPs
kubectl label ns <namespace> --list | grep pod-security
For a broader treatment of Kubernetes security, see our Kubernetes security best practices guide.
Control Plane Issues
17. API Server Unavailability or Slowness
When kubectl commands timeout or fail, the API server is usually the culprit.
Common causes: etcd performance degradation, excessive LIST calls on large objects, certificate expiration, resource exhaustion on control plane nodes, or too many active watchers overwhelming the API server.
Debug steps:
kubectl cluster-info
kubectl get --raw /healthz
On managed Kubernetes services (EKS, AKS, GKE), API server issues are typically handled by the cloud provider, but you should still monitor API server latency metrics. See our EKS vs AKS vs GKE comparison for how each provider handles control plane availability differently.
18. etcd Issues (Quorum Loss, Storage Full)
etcd is the single source of truth for all cluster state. When it is unhealthy, writes fail silently — pods cannot be scheduled, ConfigMaps cannot be updated, and scaling operations hang.
Common causes: loss of quorum (two out of three control plane nodes down), database hitting its size limit (default 2 GB, maximum 8 GB), slow disk I/O on etcd nodes, or compaction not running.
Debug steps:
etcdctl endpoint health
etcdctl endpoint status --write-out=table
etcdctl alarm list
If etcd has hit its storage limit, you must run compaction followed by defragmentation before the cluster can accept writes again. This is a P0 scenario.
HPA and Autoscaling Failures
19. HPA Not Scaling
The Horizontal Pod Autoscaler sees metrics but refuses to scale, or it cannot fetch metrics at all.
Common causes: the metrics-server is not running or misconfigured, pod containers lack CPU or memory requests (HPA calculates utilisation as a percentage of requests), the target utilisation is set too high or too low, or node capacity is exhausted so new replicas cannot be scheduled.
Debug steps:
kubectl get hpa -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
20. CPU Throttling Causing Latency
This one is subtle and commonly overlooked. Even when average CPU utilisation is low, the Linux Completely Fair Scheduler (CFS) can throttle burst workloads that exceed the CPU limit within a single quota period.
Common causes: CPU limits set too aggressively relative to burst patterns, the CFS quota period (typically 100ms) being too short for bursty workloads, and teams confusing average utilisation with peak utilisation.
Key metric to watch:
container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total
If this ratio exceeds 25 per cent, your workloads are experiencing significant throttling. Consider raising or removing CPU limits for latency-sensitive services while keeping CPU requests in place for scheduling purposes.
Essential Troubleshooting Tools
CLI Tools
| Tool | Purpose |
|---|---|
| kubectl debug | Attach ephemeral containers to running pods or debug nodes directly — essential for distroless images |
| k9s | Terminal-based interactive dashboard with real-time resource viewing and keyboard-driven navigation |
| stern | Multi-pod, multi-container log tailing with colour-coded output per pod |
| crictl | Container runtime CLI for node-level debugging (crictl ps, crictl logs, crictl inspect) |
Network Debugging
| Tool | Purpose |
|---|---|
| netshoot | Swiss Army knife container image with tcpdump, netstat, iptables, curl, dig, and more |
| dnsutils | Lightweight pod image for DNS debugging with nslookup and dig |
Observability Stack
For production clusters, we always recommend a comprehensive monitoring stack. Our Prometheus monitoring guide for Kubernetes covers the full setup, but at a minimum you need:
- Prometheus for metrics collection and alerting (CPU throttling, OOM events, pod restarts)
- Grafana for visualisation dashboards
- A log aggregation solution (Loki, Fluentd, or Fluent Bit) for centralised log search
Real-World War Stories
These are drawn from our consulting engagements and from the excellent Kubernetes Failure Stories compilation maintained by the community.
War Story 1: The DNS Cascade
A client’s CoreDNS pods were evicted during a memory pressure event. With DNS down, every service-to-service call failed. Health checks that depended on downstream services then failed, triggering a wave of liveness probe restarts. The restarts increased DNS load on the one remaining CoreDNS pod, which also got evicted. Within four minutes, the entire namespace was down.
Root cause: CoreDNS had no PodDisruptionBudget, only two replicas, and no resource requests set (placing it in the BestEffort QoS class).
Fix: We set resource requests and limits on CoreDNS, added a PDB with minAvailable: 2, increased the replica count to three, and moved CoreDNS to a dedicated node pool with guaranteed resources.
War Story 2: The Invisible CPU Throttle
A team reported P99 latency of 800ms on an API that should respond in under 50ms. CPU utilisation graphs showed only 5 per cent average usage. Everything looked healthy.
The issue was CPU limits set to 200m. The CFS quota of 20ms per 100ms period meant that any request processing burst exceeding 20ms of CPU time was throttled until the next period. We found container_cpu_cfs_throttled_periods_total at 60 per cent.
Fix: We removed CPU limits entirely (keeping CPU requests at 200m for scheduling) and P99 latency dropped to 35ms. The CNCF troubleshooting guide confirms this is one of the most under-diagnosed production issues.
War Story 3: Pod Priority Preemption Gone Wrong
Inspired by the documented Grafana Labs incident, a client deployed a high-priority batch processing job without understanding the implications. The batch job preempted production API pods. The evicted API pods attempted to reschedule, but the batch job had consumed all available node resources. The production API was down for 25 minutes.
Fix: We introduced PodDisruptionBudgets for all production workloads, created a dedicated node pool for batch processing with appropriate taints, and implemented a PriorityClass hierarchy that explicitly prevented batch jobs from preempting production services.
Proactive Monitoring Checklist
The best troubleshooting is the kind you never have to do. Here are the alerts we deploy on every client cluster from day one:
Pod Health:
- Pod restart count > 3 in 15 minutes
- Pod in CrashLoopBackOff for > 5 minutes
- OOMKilled events
- CPU throttling ratio > 25 per cent
Node Health:
- Node NotReady for > 2 minutes
- Node memory utilisation > 85 per cent
- Node disk utilisation > 80 per cent
- Node PID count approaching limit
Cluster Components:
- CoreDNS pod count < desired replicas
- CoreDNS latency > 500ms
- etcd database size > 50 per cent of quota
- API server request latency P99 > 1 second
- Certificate expiry < 30 days
Autoscaling:
- HPA at maximum replicas for > 10 minutes
- Cluster Autoscaler pending pods for > 5 minutes
- PVC utilisation > 80 per cent
Setting these up requires a working Prometheus and Grafana stack. We consider this non-negotiable for any production cluster.
Container Exit Code Reference
When a container terminates, the exit code tells you exactly what happened. This table saves hours of guesswork.
| Exit Code | Signal | Meaning |
|---|---|---|
| 0 | — | Clean exit (success) |
| 1 | — | Application error (generic) |
| 2 | — | Shell misuse or invalid argument |
| 125 | — | Container runtime error (failed to start) |
| 126 | — | Command invoked cannot execute (permission issue) |
| 127 | — | Command not found (wrong entrypoint or missing binary) |
| 128 | — | Invalid exit argument |
| 137 | SIGKILL (9) | OOMKilled or kubectl delete --force |
| 139 | SIGSEGV (11) | Segmentation fault (application crash) |
| 143 | SIGTERM (15) | Graceful termination requested (normal pod shutdown) |
Exit code 137 is by far the most common non-zero code we encounter. If you see it, check memory limits first.
Reduce Production Incidents With Expert Kubernetes Support
Troubleshooting production Kubernetes issues under pressure is stressful, and every minute of downtime costs money. The patterns in this guide are preventable with the right architecture, monitoring, and operational practices in place from the start.
Our team provides comprehensive Kubernetes consulting services to help you:
- Architect resilient clusters with proper resource management, PodDisruptionBudgets, and priority class hierarchies
- Implement proactive monitoring with Prometheus, Grafana, and custom alerting tailored to your workloads
- Train your on-call engineers with runbooks, triage frameworks, and hands-on incident simulation exercises
We have managed hundreds of production clusters across EKS, AKS, and GKE. Whether you need a one-time architecture review or ongoing managed support, we can help you move from reactive firefighting to proactive reliability engineering.