Kubernetes was designed for resilience. Self-healing pods, replica sets, rolling deployments — these features create a comfortable illusion that your cluster can survive anything. But that illusion breaks the moment you face a corrupted etcd database, a misconfigured Helm release that cascades across namespaces, or an entire cloud region going dark.
With enterprise downtime costing upwards of $300,000 per hour on average, and nearly two-thirds of businesses reporting data loss incidents in the past year, a robust disaster recovery strategy is not optional — it is a survival requirement. Yet in our consulting work, we consistently find that organisations running Kubernetes treat DR as an afterthought, if they address it at all.
Over the past several years, we have helped dozens of organisations build and test disaster recovery playbooks for production Kubernetes clusters across AWS, Azure, and GCP. This post distils everything we have learnt into a practical, opinionated guide. We cover what to back up, which DR patterns to choose, how GitOps transforms recovery, which tools to use, and how to run gameday exercises that actually prove your plan works.
Understanding RTO and RPO for Kubernetes Workloads
Before selecting a DR strategy, you need to define two foundational metrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
RTO is the maximum acceptable duration of downtime. If your RTO is 15 minutes, your systems must be operational within 15 minutes of a failure event.
RPO is the maximum acceptable amount of data loss measured in time. If your RPO is one hour, you can tolerate losing up to one hour of data.
These values are not one-size-fits-all. We recommend classifying your workloads into three tiers:
Tier 1: Mission-Critical
These are revenue-generating, customer-facing applications where any downtime directly impacts the business. Examples include payment processing services, real-time APIs, and primary databases.
- Target RTO: Under 5 minutes
- Target RPO: Near-zero (seconds)
- DR pattern: Active-Active or Warm Standby
Tier 2: Business-Important
Internal tools, batch processing systems, and secondary services that support operations but can tolerate brief outages. Examples include CI/CD pipelines, internal dashboards, and staging environments.
- Target RTO: 30 minutes to 2 hours
- Target RPO: 15 minutes to 1 hour
- DR pattern: Warm Standby or Pilot Light
Tier 3: Non-Critical
Development environments, sandbox clusters, and experimental workloads that can tolerate extended downtime with minimal business impact.
- Target RTO: 4 to 24 hours
- Target RPO: 24 hours
- DR pattern: Backup-Restore
This tiered framework prevents over-engineering. Running Active-Active for a development cluster wastes money, while relying on Backup-Restore for payment processing is negligent. Every organisation we work with starts by classifying workloads before touching a single tool.
What to Back Up in Kubernetes
A common misconception is that Kubernetes backups are straightforward because “everything is declarative.” In reality, a production cluster contains far more state than what lives in your Git repository. Here is a comprehensive inventory of what needs protection.
etcd
The etcd datastore is the brain of your Kubernetes cluster. It stores every object — deployments, services, config maps, secrets, RBAC policies, and custom resource definitions. Losing etcd without a backup means losing the entire cluster state.
We back up etcd using etcdctl snapshot save on a scheduled basis, typically every 15 minutes for Tier 1 clusters. The official etcd disaster recovery documentation details the snapshot and restore process. For managed Kubernetes services where you do not have direct etcd access, the cloud provider manages this, but you should still understand the provider’s backup retention and recovery procedures.
Persistent Volumes (PVs)
Stateful workloads — databases, message queues, file storage — store data on persistent volumes. Kubernetes does not back these up automatically. You need VolumeSnapshot resources or a tool like Velero to capture PV data consistently.
The challenge is application-consistent snapshots. A naive volume snapshot of a running PostgreSQL database may capture data in a corrupted state. Application-aware backup tools quiesce the application before snapshotting, ensuring data integrity upon restore.
Manifests and Configuration
Deployments, services, ingress rules, config maps, network policies, resource quotas, and namespace definitions all need to be recoverable. If you practise GitOps, most of this lives in version control. But we frequently encounter clusters where engineers have made ad-hoc kubectl apply changes that never made it back to Git. These drift silently until a disaster reveals the gap.
Custom Resource Definitions (CRDs)
Operators and custom controllers rely on CRDs to extend the Kubernetes API. Cert-manager certificates, Istio virtual services, Prometheus monitoring rules — these are all defined through CRDs. If your backup solution only captures core Kubernetes objects, you will lose these during recovery.
Secrets
Kubernetes secrets often contain database credentials, API keys, TLS certificates, and service account tokens. They must be backed up, but they also require careful handling. Secrets should be encrypted at rest in your backup storage and ideally managed through an external secrets store. For a deep dive into protecting sensitive data, see our guide on Kubernetes secrets management best practices.
Helm Release Metadata
If you deploy applications with Helm, the release history is stored as secrets in the cluster. Without this metadata, helm upgrade and helm rollback commands will fail because Helm cannot find prior release state. Include Helm release secrets in your backup scope.
DR Patterns Compared
There is no single correct DR architecture for Kubernetes. The right choice depends on your RTO/RPO requirements, budget, and operational maturity. We categorise DR approaches into four patterns.
Backup-Restore
The simplest pattern. You take periodic backups of cluster state and persistent data, store them in a remote location, and restore to a new cluster when disaster strikes.
How it works: A tool like Velero runs on a schedule, capturing Kubernetes objects and PV snapshots. Backups are stored in object storage (S3, GCS, Azure Blob). During recovery, you provision a new cluster and restore from the latest backup.
Strengths: Low cost, simple to implement, suitable for non-critical workloads.
Weaknesses: Highest RTO (hours), data loss between last backup and failure, manual intervention required.
Pilot Light
A minimal version of your production environment runs continuously in a secondary region. Core infrastructure (networking, DNS, identity) is pre-provisioned, but application workloads are scaled to zero or minimal replicas.
How it works: Terraform or Pulumi maintains the standby infrastructure. During a disaster, you scale up the workloads, restore data from the latest backup, and redirect traffic.
Strengths: Faster recovery than Backup-Restore, moderate cost, infrastructure is pre-validated.
Weaknesses: Still requires data restoration, RTO measured in tens of minutes, some manual steps.
Warm Standby (Active-Passive)
A fully running replica of your production environment exists in a secondary region, but it does not serve production traffic. Data is continuously replicated.
How it works: Both clusters run identical workloads. A database replication mechanism (such as PostgreSQL streaming replication or cloud-native replication) keeps data synchronised. DNS failover or a global load balancer redirects traffic during failure.
Strengths: Low RTO (minutes), near-zero RPO with synchronous replication, automated failover possible.
Weaknesses: Higher cost (running two clusters), replication lag management, complexity in testing failover.
Active-Active
Both clusters serve production traffic simultaneously. A global load balancer distributes requests across regions.
How it works: Applications run in multiple regions with data synchronised bi-directionally. Each region can handle the full production load independently.
Strengths: Near-zero RTO and RPO, no failover needed (traffic redistributes automatically), highest availability.
Weaknesses: Highest cost, significant complexity in data consistency (especially for stateful workloads), requires application-level design for multi-region operation.
Pattern Comparison
| Pattern | Cost | Complexity | RPO | RTO | Best For |
|---|---|---|---|---|---|
| Backup-Restore | Low | Low | Hours | Hours | Tier 3, dev/staging |
| Pilot Light | Medium-Low | Medium | Minutes-Hours | 15-60 min | Tier 2/3 workloads |
| Warm Standby | Medium-High | High | Seconds-Minutes | 1-5 min | Tier 1/2 workloads |
| Active-Active | High | Very High | Near-zero | Near-zero | Tier 1 only |
Most organisations we advise adopt a mixed approach: Active-Active or Warm Standby for Tier 1 workloads, Pilot Light for Tier 2, and Backup-Restore for Tier 3. This balances cost against business risk.
GitOps-Driven Disaster Recovery
GitOps fundamentally changes the DR equation. When your entire cluster configuration is declared in Git repositories and reconciled by a GitOps controller, recovery becomes a matter of pointing a new cluster at the same Git source. You are, in effect, practising disaster recovery every time you deploy.
For organisations already using GitOps workflows with ArgoCD and Helm, the recovery process simplifies dramatically.
The GitOps Recovery Workflow
Here is the recovery workflow we standardise across our client engagements:
Step 1: Provision Infrastructure with Terraform
Terraform provisions the new cluster — VPC, subnets, node pools, IAM roles, and the Kubernetes control plane itself. Because Terraform state is stored remotely (in S3 or equivalent), the same infrastructure can be recreated identically in any region. For organisations building production-ready EKS clusters on AWS, this step is already codified.
Step 2: Bootstrap ArgoCD
ArgoCD is installed on the new cluster and configured to point at the Git repositories containing your application definitions. ArgoCD’s disaster recovery documentation recommends exporting ArgoCD application definitions and storing them in Git alongside your application manifests.
Step 3: ArgoCD Reconciles Application State
Once connected to Git, ArgoCD automatically deploys all applications, services, config maps, CRDs, and their dependencies in the correct order. This is the power of declarative configuration: the desired state is the source of truth, not the running cluster.
Step 4: Velero Restores Persistent Data
While ArgoCD handles stateless application configuration, Velero restores persistent volume data from cross-region backups. This covers databases, file storage, and any other stateful workloads.
Step 5: DNS Failover and Validation
After applications are running and data is restored, DNS records are updated to point to the new cluster. Health checks confirm that services are responding correctly before traffic is fully shifted.
What GitOps Cannot Recover
GitOps recovers declarative state beautifully, but it has blind spots:
- Persistent volume data — Git stores configuration, not database rows. You still need Velero or equivalent.
- In-flight transactions — Requests that were being processed at the moment of failure are lost unless your application implements idempotency.
- External state — DNS records, cloud IAM policies outside Terraform, third-party SaaS configurations, and manual network routes are not captured in your Git repository.
- Runtime drift — Any changes applied with
kubectlthat were not committed to Git will be lost. This is why enforcing GitOps discipline is a DR concern, not just a workflow preference.
Kubernetes Backup Tools Compared
The tooling landscape for Kubernetes backup and DR has matured significantly. Here is our assessment of the leading options based on production deployments we have managed.
Velero
Velero is the CNCF open-source standard for Kubernetes backup and restore. It captures Kubernetes objects and persistent volumes, storing them in cloud object storage. Velero supports scheduled backups, cross-region restores, and resource filtering.
Strengths: Free, widely adopted, strong community, integrates with all major cloud providers.
Limitations: Only cluster administrators can perform backups (no delegated RBAC), limited application-awareness, single encryption key for all backups, no built-in immutable backup support.
Kasten K10 (Veeam)
Kasten K10 is an enterprise-grade backup platform acquired by Veeam. It holds the largest market mindshare at 36.3% among container backup solutions. Kasten provides a web-based dashboard, granular RBAC, application-aware backups, and ransomware protection through immutable backups.
Strengths: Enterprise features, delegated backup/restore to application teams, envelope encryption with unique keys per backup, policy-driven automation, strong compliance support.
Limitations: Commercial licensing, more complex deployment, higher resource overhead.
Portworx PX-Backup
Portworx, from Pure Storage, offers zero-RPO disaster recovery with synchronous replication and application-aware backups. It is tightly integrated with the Portworx storage platform.
Strengths: Zero-RPO capability, application-aware snapshots, database-as-a-service features, multi-cloud migration support.
Limitations: Requires Portworx storage layer, higher cost, vendor lock-in risk.
TrilioVault for Kubernetes
Trilio provides application-centric backup and recovery with a focus on capturing the complete application state, including metadata, configuration, and data.
Strengths: Application-centric approach, Helm-aware backups, continuous restore capability, no storage lock-in.
Limitations: Smaller community than Velero or Kasten, newer entrant in the market.
CloudCasa
CloudCasa (from Catalogic Software) offers a SaaS-based approach to Kubernetes backup with a free tier for smaller environments.
Strengths: SaaS model reduces operational overhead, free tier available, agentless discovery, cross-cluster restore.
Limitations: Data leaves your environment (potential compliance concern), feature limitations on the free tier.
Tool Comparison
| Feature | Velero | Kasten K10 | Portworx | TrilioVault | CloudCasa |
|---|---|---|---|---|---|
| Licence | Open Source | Commercial | Commercial | Commercial | Freemium |
| Mindshare | 21.1% | 36.3% | — | — | — |
| App-Aware | Limited | Yes | Yes | Yes | Partial |
| Immutable Backups | No | Yes | Yes | Yes | Yes |
| Delegated RBAC | No | Yes | Yes | Partial | Yes |
| Multi-Cluster | Manual | Yes | Yes | Yes | Yes |
| Encryption | Single key | Per-backup keys | Per-backup keys | Per-backup keys | Managed |
For most organisations, we recommend starting with Velero for non-critical workloads and evaluating Kasten K10 for production Tier 1 workloads that require enterprise features, compliance, and delegated access.
Disaster Recovery on Managed Kubernetes
Each major cloud provider offers native backup capabilities for their managed Kubernetes service. Understanding these is essential for building a provider-aligned DR strategy.
Amazon EKS
AWS Backup now provides native support for Amazon EKS, eliminating the need for custom scripts or third-party tooling for basic backup scenarios. AWS Backup can protect EKS cluster resources (deployments, services, config maps) and persistent volumes backed by EBS.
Key capabilities:
- Policy-driven backup schedules with retention rules
- Cross-region and cross-account backup copies for DR
- Integration with AWS Organisations for centralised backup governance
- Point-in-time recovery for EBS-backed persistent volumes
For organisations running EKS, we typically layer AWS Backup for infrastructure-level protection with Velero or Kasten for application-level granularity. Our guide on EKS architecture best practices covers the networking and IAM prerequisites that underpin a solid DR foundation.
Azure AKS
Azure Backup for AKS provides a managed backup experience integrated with the Azure Backup vault. It supports backup and restore of both cluster state and persistent volumes.
Key capabilities:
- Scheduled and on-demand backups through Azure Backup policies
- Granular restore at the namespace or workload level
- Azure Disk and Azure Files snapshot integration
- Vault-tier storage with geo-redundancy for cross-region DR
Google GKE
GKE Backup for Google Kubernetes Engine provides a native, fully managed backup and restore service. It captures both configuration and volume data.
Key capabilities:
- Backup plans with configurable schedules and retention
- Application-level consistency through custom hooks (pre/post backup)
- Cross-region restore for multi-region DR architectures
- Integration with Google Cloud’s IAM and encryption services
Cross-Cloud Considerations
If your organisation runs Kubernetes across multiple cloud providers, avoid relying solely on provider-native backup tools. A cross-cloud DR strategy requires a provider-agnostic tool like Velero or Kasten that can back up from EKS and restore to GKE, or vice versa.
Building a DR Gameday Playbook
A disaster recovery plan that has never been tested is not a plan — it is a hypothesis. We run quarterly DR gamedays with every client engagement, and we have seen supposedly robust plans fail spectacularly during their first real test.
Here is the gameday framework we use.
Step 1: Define the Scenario
Choose a realistic failure scenario based on your risk assessment. Examples include:
- Total loss of the primary cluster (region outage)
- etcd corruption or deletion
- Accidental namespace deletion by an engineer
- Ransomware encrypting persistent volumes
- Control plane failure (API server unreachable)
Step 2: Establish Success Criteria
Before running the exercise, define what “success” means:
- Cluster restored within the defined RTO
- Data loss within the defined RPO
- All Tier 1 services passing health checks
- Monitoring and alerting operational on the recovered cluster
- No customer-visible errors (for Active-Active or Warm Standby tests)
Step 3: Inject the Failure
Use chaos engineering tools to simulate the disaster in a controlled manner:
Chaos Mesh supports pod failures, network partitions, I/O faults, and time skew — all declaratively configured through Kubernetes CRDs.
LitmusChaos provides a hub of pre-built chaos experiments, including pod deletion, node drain, disk fill, and DNS errors.
For a namespace deletion scenario, you might use Chaos Mesh to kill all pods in a namespace while simultaneously deleting the namespace itself, simulating an accidental kubectl delete namespace production event.
Step 4: Execute the Recovery
The on-call team follows the documented runbook to recover. Critical observations during this phase:
- Time every step. Where are the bottlenecks?
- Note every deviation. Did the team need to improvise? That reveals a gap in the runbook.
- Track dependencies. Did the recovery stall waiting for a DNS change, a secrets rotation, or a manual approval?
Step 5: Conduct a Blameless Post-Mortem
After the exercise, the team reviews what happened:
- Did recovery meet the RTO and RPO targets?
- What steps took longer than expected?
- Were any backup artifacts missing, expired, or corrupted?
- Did the runbook have gaps or ambiguities?
- Were the right people available and informed?
Document every finding and update the runbook before the next gameday.
Automation Goals
Over successive gamedays, aim to automate as much of the recovery as possible. The ideal end state is a single command (or automated trigger) that provisions infrastructure, deploys applications, restores data, and validates health — reducing human error and shaving minutes off your RTO.
Common Kubernetes DR Mistakes
In our years of consulting, we have seen the same mistakes repeated across organisations of every size. Avoiding these pitfalls will save you from painful lessons during an actual incident.
1. Relying on Kubernetes Self-Healing as a DR Strategy
Kubernetes can restart a crashed pod and reschedule workloads off a failed node. It cannot recover from etcd corruption, region-wide outages, or the deletion of an entire namespace. Self-healing is a resilience feature, not a DR strategy. These are fundamentally different concerns.
2. Not Backing Up etcd
On self-managed clusters, we have encountered teams with no etcd backup strategy whatsoever. The etcd datastore is the single most critical component. Without it, your cluster cannot function, and rebuilding from scratch without a snapshot is extraordinarily painful.
3. Ignoring Persistent Volume Data
Backing up Kubernetes objects (deployments, services, config maps) without backing up the data on persistent volumes is like backing up a database schema without the data. When you restore, your applications start, but they have no data to serve.
4. Never Testing the Recovery Process
This is the most dangerous and most common mistake. Teams configure Velero, see green status indicators, and assume DR is handled. But untested backups are Schrodinger’s backups — you do not know if they work until you try to restore. We have seen corrupt backups, expired cloud credentials, and misconfigured IAM policies all surface during the first real test. For a broader view of operational pitfalls, see our post on Kubernetes mistakes to avoid in production.
5. Storing Backups in the Same Region as Production
If your backups reside in the same region as your production cluster, a regional outage takes out both. Always replicate backups to a geographically separate region, and verify that cross-region restore actually works.
6. No Documented Runbook
When a disaster strikes at 3 AM, the on-call engineer should not be improvising. A clear, step-by-step runbook with commands, expected outputs, and escalation procedures is essential. We create runbooks as living documents that are updated after every gameday exercise.
7. Forgetting About Secrets and Certificates
Secrets that are not backed up or that are backed up without encryption are a dual risk. You either lose access to external services during recovery, or you expose sensitive credentials in your backup storage. Manage secrets through an external secrets manager and ensure TLS certificates (including their private keys) are recoverable. Our guide on Kubernetes secrets management covers this in depth.
8. Neglecting CRDs and Operator State
Custom Resource Definitions power the operator ecosystem — cert-manager, Istio, Prometheus Operator, and many others. If your backup does not include CRDs and their associated custom resources, restoring the cluster will leave these operators broken.
9. Treating DR as a One-Time Project
Infrastructure changes. New services are deployed, new databases are provisioned, backup retention policies expire. A DR plan that was valid six months ago may have critical gaps today. We schedule quarterly reviews of DR plans alongside quarterly gamedays. For organisations strengthening their overall security posture, our Kubernetes security best practices guide provides complementary hardening strategies.
10. No Communication Plan
DR is not purely a technical exercise. Stakeholders need to know what happened, what the impact is, and when services will be restored. Define communication templates, escalation chains, and status page update procedures as part of your DR playbook.
Putting It All Together
Kubernetes disaster recovery is not a single tool or a single document. It is a practice — a combination of architecture decisions, tooling, automation, and regular testing that together ensure your organisation can survive and recover from the worst.
Here is our recommended approach, summarised:
- Classify workloads into Tiers 1, 2, and 3 based on business impact
- Define RTO and RPO for each tier
- Select the appropriate DR pattern (Active-Active through Backup-Restore) per tier
- Implement GitOps to make cluster state declarative and recoverable from Git
- Deploy backup tooling (Velero for open source, Kasten for enterprise) to protect persistent data and cluster state
- Replicate backups cross-region and validate restore regularly
- Document everything in a runbook with specific commands, expected outputs, and escalation paths
- Run quarterly gamedays with chaos engineering tools to validate the plan
- Iterate — update the plan after every gameday, every infrastructure change, and every incident
With 90% of containerised deployments now running on Kubernetes, the stakes have never been higher. The organisations that invest in DR today are the ones that will still be operating tomorrow.
Build a Kubernetes DR Strategy That Survives Real Disasters
A disaster recovery plan is only as good as its last successful test. Too many organisations discover gaps in their DR strategy during an actual incident — when the cost of failure is measured in lost revenue, damaged reputation, and engineering hours spent in crisis mode.
Our team provides comprehensive Kubernetes consulting services to help you:
- Design and implement tiered DR architectures with GitOps-driven recovery, cross-region backup replication, and automated failover aligned to your RTO and RPO targets
- Build and run DR gameday programmes using Chaos Mesh and LitmusChaos to validate your recovery procedures quarterly, with detailed runbooks your on-call team can execute under pressure
- Select, deploy, and optimise backup tooling across EKS, AKS, and GKE, integrating Velero, Kasten, or cloud-native backup services into your existing platform engineering workflows
We have built and tested DR playbooks for production Kubernetes clusters across industries including finance, healthcare, and SaaS. Every engagement starts with a risk assessment and ends with a proven, tested recovery plan.