Kubernetes upgrades should be routine. In practice, they are anything but. With three minor releases per year, a 14-month support window, and extended support costs that can jump by 500%, falling behind on upgrades is one of the most expensive mistakes an organisation can make — and one of the most common.
In our consulting work at Tasrie IT Services, we have managed upgrades across more than 60 production clusters spanning EKS, AKS, and GKE. Along the way, we have tested every major upgrade strategy, from simple rolling updates to full blue-green cluster swaps. This guide distils what we have learnt into a practical, opinionated playbook that covers why upgrades matter, which strategy to choose, a step-by-step pre-upgrade runbook, provider-specific processes, and what to do when things go wrong.
Why Kubernetes Upgrades Cannot Wait
The Kubernetes project maintains only the three most recent minor versions under its N-2 support policy. Each version receives approximately 14 months of patch support — 12 months of active maintenance plus a 2-month upgrade buffer. Once a version leaves support, it no longer receives security patches, bug fixes, or CVE remediations.
That alone should motivate timely upgrades. But the financial penalties from managed providers add urgency.
The Real Cost of Delayed Upgrades
| Provider | Standard Support | Extended Support | Cost Increase | Extended Duration |
|---|---|---|---|---|
| EKS | $0.10/cluster/hr | $0.60/cluster/hr | 6x | 12 months |
| GKE | Standard pricing | 500% surcharge | 6x | Varies by channel |
| AKS | Free control plane | Free (LTS versions only) | None for LTS | 24 months (opt-in) |
For a single EKS cluster, the jump from $0.10 to $0.60 per hour translates to an additional $4,380 per year. Multiply that across a fleet of 20 clusters and the organisation is paying an extra $87,600 annually — simply for not upgrading. That budget would be far better spent on engineering effort to keep clusters current.
Amazon EKS automatically enrols clusters into extended support once standard support ends, so the cost increase arrives silently. GKE applies its 500% surcharge across all release channels. Only Azure AKS offers a genuinely free alternative with its Long-Term Support channel, though it covers only select versions.
The 2026 Version Landscape
As of February 2026, these are the active Kubernetes versions and their support timelines:
| Version | Upstream Release | Standard Support Ends | Status |
|---|---|---|---|
| 1.32 | December 2024 | February 2026 | EOL / Extended |
| 1.33 | April 2025 | June 2026 | Active |
| 1.34 | August 2025 | October 2026 | Active |
| 1.35 | December 2025 | February 2027 | Active (Current) |
If your clusters are still running 1.32 or earlier, they are already in extended support territory. The longer you wait, the more versions you need to traverse sequentially — and the version skew policy forbids skipping minor versions.
The Three Upgrade Strategies: What We Tested
Over the past 18 months, we have executed production upgrades using three distinct strategies. Each has clear trade-offs across risk, cost, speed, and operational complexity. Understanding these trade-offs is critical when building Kubernetes migration strategies for your organisation.
Strategy 1: Rolling In-Place Upgrade
The rolling strategy upgrades nodes one at a time within the existing cluster. Each node is cordoned, drained of workloads, upgraded to the target version, and uncordoned.
How it works:
- Upgrade the control plane to the target minor version
- Cordon the first worker node to prevent new pod scheduling
- Drain the node, evicting all pods (which reschedule onto remaining nodes)
- Upgrade the node’s kubelet and container runtime
- Uncordon the node
- Repeat for each remaining node
Strengths:
- Lowest resource overhead — no additional nodes required beyond surge capacity
- Fastest for small clusters (fewer than 10 nodes)
- Native support in all managed providers (
az aks upgrade,eksctl upgrade nodegroup)
Weaknesses:
- Each node drain introduces brief workload disruption if PDBs are misconfigured
- Rollback is difficult — you cannot downgrade kubelet to an older version than the control plane
- If the upgrade introduces a bug, you discover it progressively as nodes cycle through
Best for: Development and staging clusters, small production clusters with good PDB coverage, patch-level upgrades (e.g., 1.34.1 to 1.34.3).
Strategy 2: Blue-Green Node Pool Migration
Instead of upgrading nodes in place, this strategy creates an entirely new node pool running the target Kubernetes version alongside the existing pool. Workloads are migrated by cordoning old nodes and allowing the scheduler to place pods on the new pool.
How it works:
- Upgrade the control plane to the target minor version
- Create a new node pool with the target kubelet version (the “green” pool)
- Wait for all green nodes to reach
Readystatus - Cordon all nodes in the old “blue” pool
- Drain blue nodes one at a time, allowing pods to schedule on green nodes
- Validate application health on the green pool
- Delete the old blue node pool
Strengths:
- Near-instant rollback — simply uncordon blue nodes and drain green if issues arise
- Workloads migrate to freshly provisioned nodes with clean OS images
- Full control over migration pace and timing
Weaknesses:
- Requires double the node capacity during the transition window
- More expensive for large clusters (you are paying for two pools simultaneously)
- Requires careful handling of node-local storage and DaemonSets
Best for: Production clusters where downtime risk must be minimised, minor version upgrades (e.g., 1.33 to 1.34), clusters with strict compliance or SLA requirements.
Strategy 3: Blue-Green Cluster Swap
The most conservative approach creates an entirely new cluster running the target version. Workloads are deployed fresh, validated, and traffic is switched at the load balancer or DNS level.
How it works:
- Provision a new cluster at the target Kubernetes version
- Deploy all workloads using GitOps (ArgoCD, Flux) or Helm
- Run smoke tests and conformance checks on the new cluster
- Shift traffic gradually (canary weight at the load balancer)
- Once validated, drain traffic from the old cluster
- Decommission the old cluster
Strengths:
- Complete isolation between old and new environments
- True rollback — the old cluster remains fully operational until decommissioned
- Opportunity to rebuild infrastructure-as-code definitions cleanly
- Ideal for major version jumps or clusters with significant configuration drift
Weaknesses:
- Highest cost — two full clusters running simultaneously
- Requires mature GitOps practices to ensure the new cluster matches the old
- Stateful workloads (databases, queues) need careful data migration planning
- DNS/load balancer switchover introduces its own failure modes
Best for: Major upgrades with multiple version jumps, compliance-sensitive environments (PCI DSS, HIPAA), organisations with mature GitOps and ArgoCD workflows.
The Decision Matrix
| Factor | Rolling | Blue-Green Node Pool | Blue-Green Cluster |
|---|---|---|---|
| Resource overhead | Low (surge only) | Medium (2x nodes) | High (2x cluster) |
| Rollback speed | Slow/difficult | Fast (uncordon old) | Instant (traffic switch) |
| Downtime risk | Medium | Low | Lowest |
| Operational complexity | Low | Medium | High |
| Best for patch upgrades | Yes | Overkill | Overkill |
| Best for minor upgrades | Sometimes | Yes | Sometimes |
| Best for multi-version jumps | No | No | Yes |
| GitOps maturity required | Low | Low | High |
Our recommendation: Use rolling upgrades for patches, blue-green node pools for minor version upgrades, and blue-green cluster swaps only when jumping multiple versions or operating under strict regulatory requirements. This layered approach balances safety with cost efficiency.
Pre-Upgrade Checklist: The Runbook
Every upgrade we execute follows this checklist. Each item has caught real issues in production environments. Before diving into the technical steps, ensure your Kubernetes security posture is solid — an upgrade is also an opportunity to audit and harden.
Step 1: Audit Deprecated APIs
Deprecated APIs are the single most common cause of upgrade failures. Use Pluto for static manifest scanning and kubent for live cluster analysis.
# Scan Helm releases for deprecated APIs
pluto detect-helm --target-versions k8s=v1.34
# Scan live cluster for deprecated API usage
kubent
# Scan specific manifest files
pluto detect-files -d ./manifests/ --target-versions k8s=v1.34
The Kubernetes Deprecated API Migration Guide lists every deprecated and removed API by version. Review it against your target version before proceeding.
If you find deprecated resources, use the kubectl convert plugin to update manifests:
# Install the convert plugin
kubectl krew install convert
# Convert a manifest from a deprecated API version
kubectl convert -f old-ingress.yaml --output-version networking.k8s.io/v1
Step 2: Verify Add-On and CNI Compatibility
Before upgrading, confirm that every cluster add-on supports the target Kubernetes version. Add-ons do not update automatically during cluster upgrades.
# Check current add-on versions (EKS example)
aws eks describe-addon-versions \
--kubernetes-version 1.34 \
--query 'addons[].{Name:addonName,Versions:addonVersions[0].addonVersion}'
# For self-managed clusters, verify CoreDNS, kube-proxy,
# and CNI plugin compatibility against their release notes
kubectl get pods -n kube-system -o wide
Critical add-ons to verify: CoreDNS, kube-proxy, your CNI plugin (Calico, Cilium, AWS VPC CNI), CSI drivers, ingress controllers, and cert-manager.
Step 3: Back Up etcd (Self-Managed Clusters)
For self-managed clusters, an etcd snapshot is your last line of defence. Managed providers handle this internally, but if you run kubeadm or kOps, this step is non-negotiable.
# Take an etcd snapshot
sudo ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-pre-upgrade.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
sudo ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-pre-upgrade.db --write-table
Take backups during off-peak hours — the snapshot process has a high I/O cost and can briefly impact cluster performance.
Step 4: Review Pod Disruption Budgets
PDBs govern how many pods can be simultaneously unavailable during node drain. Misconfigured PDBs are the second most common cause of upgrade stalls.
# List all PDBs and their current status
kubectl get pdb --all-namespaces
# Check for PDBs that could block drain (maxUnavailable=0 or
# disruptionsAllowed=0)
kubectl get pdb --all-namespaces -o json | \
jq '.items[] | select(.status.disruptionsAllowed == 0) |
{namespace: .metadata.namespace, name: .metadata.name,
allowed: .status.disruptionsAllowed}'
If any PDB shows disruptionsAllowed: 0, investigate whether the underlying deployment has enough replicas. A deployment with replicas: 1 and maxUnavailable: 0 will block node drain indefinitely.
Step 5: Run Conformance Tests
Use Sonobuoy to validate that your cluster meets Kubernetes conformance requirements before introducing upgrade variables.
# Run a quick conformance check (takes ~10 minutes)
sonobuoy run --mode quick --wait
# Retrieve and inspect results
sonobuoy retrieve
sonobuoy results $(sonobuoy retrieve)
Step 6: Test in a Non-Production Environment
Create a staging cluster that mirrors your production configuration as closely as possible. Deploy the same workloads, apply the same network policies, and run the same monitoring stack. Upgrade this staging cluster first and soak-test for at least 24-48 hours before touching production.
Control Plane First, Data Plane Second
The Kubernetes version skew policy dictates a strict ordering: the control plane must always be upgraded before worker nodes. The kubelet must never be newer than the kube-apiserver.
Component Compatibility Rules
| Component | Allowed Skew from kube-apiserver |
|---|---|
| kube-controller-manager | Same version or 1 minor version older |
| kube-scheduler | Same version or 1 minor version older |
| kubelet | Up to 3 minor versions older |
| kube-proxy | Up to 3 minor versions older |
| kubectl | 1 minor version newer or older |
While the kubelet can theoretically run 3 minor versions behind the API server, we strongly advise keeping the gap to 1 minor version at most. A wider skew increases the surface area for subtle compatibility bugs and makes troubleshooting significantly harder.
The Upgrade Sequence
For both self-managed and managed clusters, follow this order:
- Control plane — upgrade kube-apiserver, kube-controller-manager, kube-scheduler, and cloud-controller-manager
- Add-ons — upgrade CoreDNS, kube-proxy, CNI plugin, and CSI drivers to compatible versions
- Worker nodes — upgrade kubelet and container runtime on each node (rolling or blue-green)
Managed providers automate much of this. On EKS, aws eks update-cluster-version handles the control plane; you then upgrade managed node groups separately. On GKE, automatic upgrades handle both planes by default. On AKS, az aks upgrade upgrades the control plane and node pools together or separately.
Configuring Pod Disruption Budgets for Zero Downtime
Pod Disruption Budgets are the mechanism that makes zero-downtime upgrades possible. Without them, a node drain can evict every replica of a service simultaneously, causing an outage.
PDB Configuration Patterns
# Pattern 1: Percentage-based (recommended for most services)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-frontend-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: web-frontend
---
# Pattern 2: Absolute count (for services with fixed replica counts)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-gateway-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: api-gateway
---
# Pattern 3: For stateful workloads (databases, message queues)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: postgres-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: postgres
PDB Anti-Patterns to Avoid
maxUnavailable: 0— Blocks all voluntary disruptions, including node drain. The drain process will wait up to one hour per node before timing out (on GKE) or indefinitely (on EKS/AKS). This is the number one cause of stuck upgrades.- Mismatched selectors — A PDB whose label selector matches no pods provides no protection at all. Verify selectors with
kubectl get pods -l app=your-app. - Single-replica deployments with PDBs — A deployment with
replicas: 1andminAvailable: 1creates a deadlock: the pod cannot be evicted because it would violate the PDB, but it cannot be rescheduled because the node must be drained first. Either remove the PDB or increase replicas to at least 2.
Kubernetes 1.33: Control Plane Rollback Changes Everything
Historically, Kubernetes upgrades were a one-way street. Once the control plane was upgraded, the only path forward was to fix issues in the new version — rolling back to a previous minor version was not supported and could corrupt cluster state.
Kubernetes 1.33 introduced KEP-4330, which fundamentally changes this. The new two-step upgrade process introduces an “emulated version” concept, allowing the control plane to run newer binaries while behaving as though it is still on the previous version.
How Two-Step Upgrades Work
- Binary upgrade — Upgrade the control plane binaries to 1.33+ but set the emulated version to the previous release (e.g., 1.32). The API server exposes the older API surface.
- Soak period — Run the cluster with the new binaries for a configurable period, validating stability. During this phase, rollback is safe.
- Version activation — Once satisfied, remove the emulated version setting. The control plane now fully operates at the new version.
If issues emerge during the soak period, you can roll back the binary upgrade because no irreversible schema migrations have occurred. This safety net was previously unavailable and significantly reduces the risk of minor version upgrades.
Important limitation: This feature is available only for upgrades to version 1.33 or later and applies to control plane components only. Node rollback still follows the traditional approach.
Provider-Specific Upgrade Processes
Each cloud provider implements Kubernetes upgrades differently. If you are evaluating providers, our detailed EKS vs AKS vs GKE comparison covers broader architectural differences.
Amazon EKS
EKS requires the most manual intervention of the three major providers. The control plane and data plane are upgraded separately.
# Step 1: Upgrade the control plane
aws eks update-cluster-version \
--region eu-west-1 \
--name production-cluster \
--kubernetes-version 1.34
# Wait for the update to complete (typically 20-40 minutes)
aws eks wait cluster-active --name production-cluster
# Step 2: Upgrade managed node groups
aws eks update-nodegroup-version \
--cluster-name production-cluster \
--nodegroup-name workers \
--kubernetes-version 1.34
# Alternative: Use eksctl for a streamlined experience
eksctl upgrade cluster --name production-cluster --version 1.34 --approve
EKS enrols clusters into extended support automatically. There is no opt-out — if your cluster runs past the 14-month standard support window, you begin paying $0.60/cluster/hour immediately. Plan upgrades proactively to avoid this cost trap.
Azure AKS
AKS provides a middle ground with semi-automated upgrades and built-in surge node support.
# Upgrade control plane and node pools together
az aks upgrade \
--resource-group production-rg \
--name production-cluster \
--kubernetes-version 1.34 \
--yes
# Or upgrade control plane only first
az aks upgrade \
--resource-group production-rg \
--name production-cluster \
--kubernetes-version 1.34 \
--control-plane-only
AKS supports automatic upgrades through channels: none, patch, stable, rapid, and node-image. The stable channel is recommended for production — it applies minor version upgrades only after they have been validated in the rapid channel.
AKS also offers Long-Term Support for select versions (such as 1.27) at no additional cost, providing 24 months of support. However, LTS is opt-in and covers only specific versions.
Google GKE
GKE is the most automated of the three. Upgrades are enabled by default and managed through release channels.
# Manual upgrade (if auto-upgrade is disabled)
gcloud container clusters upgrade production-cluster \
--master \
--cluster-version 1.34 \
--region europe-west2
# Upgrade node pools
gcloud container clusters upgrade production-cluster \
--node-pool workers \
--cluster-version 1.34 \
--region europe-west2
GKE supports two node pool upgrade strategies: surge upgrades (the default, adding extra nodes during drain) and blue-green upgrades (creating a parallel pool). For production workloads, we recommend the blue-green strategy.
GKE’s release channels — Rapid, Regular, and Stable — control when new versions become available. The Stable channel is typically 2-4 weeks behind upstream, giving the community time to surface issues before your production clusters are affected.
Common Upgrade Anti-Patterns
We have seen these mistakes repeatedly across client environments. Every one of them has caused production incidents or unnecessary cost. Avoiding these is just as important as understanding the common Kubernetes mistakes that affect day-to-day operations.
Anti-Pattern 1: The “We’ll Upgrade When It Breaks” Mentality
Community survey data shows that approximately 75% of organisations have no fixed upgrade cadence, upgrading only when forced. This approach guarantees that you will eventually face a multi-version jump under pressure, with cascading breaking changes, deprecated API removals, and extended support costs all compounding simultaneously.
Fix: Establish a quarterly upgrade cadence. Upgrade within 30 days of a new minor version reaching your provider’s stable channel.
Anti-Pattern 2: Skipping Minor Versions
Kubernetes explicitly does not support skipping minor versions. You cannot jump from 1.32 to 1.34; you must go through 1.33 first. Each skipped version compounds the risk because breaking changes, API deprecations, and behavioural differences accumulate.
Fix: Upgrade sequentially, one minor version at a time. Budget for each step in your upgrade plan.
Anti-Pattern 3: Upgrading Production Without Staging
Deploying an untested upgrade directly to production is a gamble. Deprecated APIs that only surface under load, add-on incompatibilities that manifest after hours of runtime, and CNI behavioural changes can all be caught in staging.
Fix: Maintain a staging cluster that mirrors production. Soak-test for at least 24 hours before promoting to production.
Anti-Pattern 4: Ignoring terminationGracePeriodSeconds
Pods with excessively long grace periods (300+ seconds) can dramatically extend upgrade windows. A cluster with 500 pods averaging a 5-minute grace period can add hours to the total drain time.
Fix: Audit grace periods across all deployments. Most applications can terminate gracefully within 30 seconds. Set the default to 30 and only increase for workloads that genuinely need longer shutdown windows.
Anti-Pattern 5: Fleet Version Fragmentation
When different teams manage their own clusters without a central version policy, version drift is inevitable. One team runs 1.35 while another lags at 1.32, creating security vulnerabilities, operational inconsistencies, and training overhead.
Fix: Implement a fleet-wide version policy. All clusters must be within one minor version of each other. Use cost management tooling to track which clusters are incurring extended support charges.
What to Do When an Upgrade Fails
Even with thorough preparation, upgrades can fail. Having a recovery plan is essential.
Control Plane Failure
If the control plane upgrade fails (common symptoms: API server unreachable, etcd leader election failures):
- Check provider status — Managed providers sometimes experience control plane upgrade failures due to capacity issues. Check your provider’s status page.
- Review events —
kubectl get events --sort-by='.lastTimestamp'(if the API server is reachable). - For self-managed clusters — Restore from the etcd snapshot taken before the upgrade. Stop the kube-apiserver and etcd before restoring.
# Restore etcd from snapshot (self-managed)
sudo ETCDCTL_API=3 etcdutl snapshot restore /backup/etcd-pre-upgrade.db \
--data-dir=/var/lib/etcd-restored
- For managed providers — Open a support case immediately. EKS, AKS, and GKE all provide control plane SLAs.
Node Drain Stuck
The most common runtime failure is a node drain that never completes, typically caused by PDB violations or pods that refuse to terminate.
# Identify which pods are blocking drain
kubectl get pods --field-selector spec.nodeName=<stuck-node> \
-o wide --all-namespaces
# Check PDB status
kubectl get pdb --all-namespaces
# Force drain (last resort -- causes downtime for affected pods)
kubectl drain <stuck-node> \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=30
Application Failures After Upgrade
If applications fail after the upgrade completes:
- Check for removed APIs —
kubectl get events | grep "no matches for kind"indicates a removed API version. - Validate webhook configurations — Admission webhooks compiled against older API versions may reject new resource formats.
- Review RBAC changes — Some versions introduce new default RBAC rules. Kubernetes 1.32, for example, moved
AuthorizeNodeWithSelectorsto beta (enabled by default), which broke some existing RBAC configurations. - Rollback if on 1.33+ — If you used the two-step upgrade process, roll back the control plane to the emulated version while you investigate.
For comprehensive recovery planning, our Kubernetes disaster recovery playbook covers etcd backup strategies, GitOps-driven recovery, and gameday testing in detail.
Building an Upgrade Cadence That Sticks
The most successful teams we work with treat Kubernetes upgrades not as a project but as a continuous process. Here is the cadence we recommend:
Monthly: Patch Upgrades
Apply patch releases within 2 weeks of availability. These contain security fixes and bug patches with no API changes. Use rolling in-place upgrades.
Quarterly: Minor Version Upgrades
Upgrade to the latest stable minor version each quarter. Use blue-green node pool migration for production clusters. Budget 1-2 days for the full cycle (staging soak + production upgrade + validation).
Continuously: Automated Dependency Tracking
Use Renovate or Dependabot to track container image updates, Helm chart versions, and add-on compatibility. These tools can automatically open pull requests when dependencies have newer versions available, keeping your manifests current between cluster upgrades.
Pre-Upgrade Automation
Codify your pre-upgrade checklist into a CI pipeline:
# Example GitHub Actions workflow for pre-upgrade validation
name: Pre-Upgrade Checks
on:
workflow_dispatch:
inputs:
target_version:
description: 'Target Kubernetes version'
required: true
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan for deprecated APIs
run: |
pluto detect-files -d ./manifests/ \
--target-versions k8s=v${{ github.event.inputs.target_version }}
- name: Check Helm releases
run: |
pluto detect-helm \
--target-versions k8s=v${{ github.event.inputs.target_version }}
- name: Validate add-on compatibility
run: |
kubent --target-version ${{ github.event.inputs.target_version }}
Key Takeaways
Kubernetes upgrades are not optional. With a three-release-per-year cadence and a 14-month support window, every organisation running Kubernetes needs a repeatable, tested upgrade process. The cost of inaction — both financial (6x extended support pricing) and operational (accumulated technical debt, security exposure) — far outweighs the effort of staying current.
Choose your strategy based on risk tolerance: rolling for patches, blue-green node pools for minor versions, blue-green clusters for major leaps. Invest in pre-upgrade tooling (Pluto, kubent, Sonobuoy) to catch issues before they reach production. And take advantage of Kubernetes 1.33’s control plane rollback capability to reduce the risk of every upgrade going forward.
Keep Your Clusters Current Without the Risk
Falling behind on Kubernetes versions creates compounding technical debt, security exposure, and escalating cloud costs. But executing upgrades across production clusters — especially at scale — requires deep expertise in version skew policies, PDB configuration, provider-specific processes, and failure recovery.
Our team provides comprehensive Kubernetes consulting services to help you:
- Build a repeatable upgrade runbook tailored to your cluster architecture, provider, and compliance requirements
- Execute zero-downtime upgrades using blue-green node pool strategies with validated rollback procedures
- Automate pre-upgrade validation with CI/CD pipelines that catch deprecated APIs and add-on incompatibilities before they reach production
We have managed upgrades across 60+ production clusters on EKS, AKS, and GKE — and we bring that experience to every engagement.