Kubernetes Disaster Recovery: Our DR Playbook for Production Clusters

Kubernetes was designed for resilience. Self-healing pods, replica sets, rolling deployments — these features create a comfortable illusion that your cluster can survive anything. But that illusion breaks the moment you face a corrupted etcd database, a misconfigured Helm release that cascades across namespaces, or an entire cloud region going dark.

With enterprise downtime costing upwards of $300,000 per hour on average, and nearly two-thirds of businesses reporting data loss incidents in the past year, a robust disaster recovery strategy is not optional — it is a survival requirement. Yet in our consulting work, we consistently find that organisations running Kubernetes treat DR as an afterthought, if they address it at all.

Over the past several years, we have helped dozens of organisations build and test disaster recovery playbooks for production Kubernetes clusters across AWS, Azure, and GCP. This post distils everything we have learnt into a practical, opinionated guide. We cover what to back up, which DR patterns to choose, how GitOps transforms recovery, which tools to use, and how to run gameday exercises that actually prove your plan works.

Understanding RTO and RPO for Kubernetes Workloads

Before selecting a DR strategy, you need to define two foundational metrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

RTO is the maximum acceptable duration of downtime. If your RTO is 15 minutes, your systems must be operational within 15 minutes of a failure event.

RPO is the maximum acceptable amount of data loss measured in time. If your RPO is one hour, you can tolerate losing up to one hour of data.

These values are not one-size-fits-all. We recommend classifying your workloads into three tiers:

Tier 1: Mission-Critical

These are revenue-generating, customer-facing applications where any downtime directly impacts the business. Examples include payment processing services, real-time APIs, and primary databases.

Target RTO: Under 5 minutes
Target RPO: Near-zero (seconds)
DR pattern: Active-Active or Warm Standby

Tier 2: Business-Important

Internal tools, batch processing systems, and secondary services that support operations but can tolerate brief outages. Examples include CI/CD pipelines, internal dashboards, and staging environments.

Target RTO: 30 minutes to 2 hours
Target RPO: 15 minutes to 1 hour
DR pattern: Warm Standby or Pilot Light

Tier 3: Non-Critical

Development environments, sandbox clusters, and experimental workloads that can tolerate extended downtime with minimal business impact.

Target RTO: 4 to 24 hours
Target RPO: 24 hours
DR pattern: Backup-Restore

This tiered framework prevents over-engineering. Running Active-Active for a development cluster wastes money, while relying on Backup-Restore for payment processing is negligent. Every organisation we work with starts by classifying workloads before touching a single tool.

What to Back Up in Kubernetes

A common misconception is that Kubernetes backups are straightforward because “everything is declarative.” In reality, a production cluster contains far more state than what lives in your Git repository. Here is a comprehensive inventory of what needs protection.

etcd

The etcd datastore is the brain of your Kubernetes cluster. It stores every object — deployments, services, config maps, secrets, RBAC policies, and custom resource definitions. Losing etcd without a backup means losing the entire cluster state.

We back up etcd using etcdctl snapshot save on a scheduled basis, typically every 15 minutes for Tier 1 clusters. The official etcd disaster recovery documentation details the snapshot and restore process. For managed Kubernetes services where you do not have direct etcd access, the cloud provider manages this, but you should still understand the provider’s backup retention and recovery procedures.

Persistent Volumes (PVs)

Stateful workloads — databases, message queues, file storage — store data on persistent volumes. Kubernetes does not back these up automatically. You need VolumeSnapshot resources or a tool like Velero to capture PV data consistently.

The challenge is application-consistent snapshots. A naive volume snapshot of a running PostgreSQL database may capture data in a corrupted state. Application-aware backup tools quiesce the application before snapshotting, ensuring data integrity upon restore.

Manifests and Configuration

Deployments, services, ingress rules, config maps, network policies, resource quotas, and namespace definitions all need to be recoverable. If you practise GitOps, most of this lives in version control. But we frequently encounter clusters where engineers have made ad-hoc kubectl apply changes that never made it back to Git. These drift silently until a disaster reveals the gap.

Custom Resource Definitions (CRDs)

Operators and custom controllers rely on CRDs to extend the Kubernetes API. Cert-manager certificates, Istio virtual services, Prometheus monitoring rules — these are all defined through CRDs. If your backup solution only captures core Kubernetes objects, you will lose these during recovery.

Secrets

Kubernetes secrets often contain database credentials, API keys, TLS certificates, and service account tokens. They must be backed up, but they also require careful handling. Secrets should be encrypted at rest in your backup storage and ideally managed through an external secrets store. For a deep dive into protecting sensitive data, see our guide on Kubernetes secrets management best practices.

Helm Release Metadata

If you deploy applications with Helm, the release history is stored as secrets in the cluster. Without this metadata, helm upgrade and helm rollback commands will fail because Helm cannot find prior release state. Include Helm release secrets in your backup scope.

DR Patterns Compared

There is no single correct DR architecture for Kubernetes. The right choice depends on your RTO/RPO requirements, budget, and operational maturity. We categorise DR approaches into four patterns.

Backup-Restore

The simplest pattern. You take periodic backups of cluster state and persistent data, store them in a remote location, and restore to a new cluster when disaster strikes.

How it works: A tool like Velero runs on a schedule, capturing Kubernetes objects and PV snapshots. Backups are stored in object storage (S3, GCS, Azure Blob). During recovery, you provision a new cluster and restore from the latest backup.

Strengths: Low cost, simple to implement, suitable for non-critical workloads.

Weaknesses: Highest RTO (hours), data loss between last backup and failure, manual intervention required.

Pilot Light

A minimal version of your production environment runs continuously in a secondary region. Core infrastructure (networking, DNS, identity) is pre-provisioned, but application workloads are scaled to zero or minimal replicas.

How it works: Terraform or Pulumi maintains the standby infrastructure. During a disaster, you scale up the workloads, restore data from the latest backup, and redirect traffic.

Strengths: Faster recovery than Backup-Restore, moderate cost, infrastructure is pre-validated.

Weaknesses: Still requires data restoration, RTO measured in tens of minutes, some manual steps.

Warm Standby (Active-Passive)

A fully running replica of your production environment exists in a secondary region, but it does not serve production traffic. Data is continuously replicated.

How it works: Both clusters run identical workloads. A database replication mechanism (such as PostgreSQL streaming replication or cloud-native replication) keeps data synchronised. DNS failover or a global load balancer redirects traffic during failure.

Strengths: Low RTO (minutes), near-zero RPO with synchronous replication, automated failover possible.

Weaknesses: Higher cost (running two clusters), replication lag management, complexity in testing failover.

Active-Active

Both clusters serve production traffic simultaneously. A global load balancer distributes requests across regions.

How it works: Applications run in multiple regions with data synchronised bi-directionally. Each region can handle the full production load independently.

Strengths: Near-zero RTO and RPO, no failover needed (traffic redistributes automatically), highest availability.

Weaknesses: Highest cost, significant complexity in data consistency (especially for stateful workloads), requires application-level design for multi-region operation.

Pattern Comparison

Pattern	Cost	Complexity	RPO	RTO	Best For
Backup-Restore	Low	Low	Hours	Hours	Tier 3, dev/staging
Pilot Light	Medium-Low	Medium	Minutes-Hours	15-60 min	Tier 2/3 workloads
Warm Standby	Medium-High	High	Seconds-Minutes	1-5 min	Tier 1/2 workloads
Active-Active	High	Very High	Near-zero	Near-zero	Tier 1 only

Most organisations we advise adopt a mixed approach: Active-Active or Warm Standby for Tier 1 workloads, Pilot Light for Tier 2, and Backup-Restore for Tier 3. This balances cost against business risk.

GitOps-Driven Disaster Recovery

GitOps fundamentally changes the DR equation. When your entire cluster configuration is declared in Git repositories and reconciled by a GitOps controller, recovery becomes a matter of pointing a new cluster at the same Git source. You are, in effect, practising disaster recovery every time you deploy.

For organisations already using GitOps workflows with ArgoCD and Helm, the recovery process simplifies dramatically.

The GitOps Recovery Workflow

Here is the recovery workflow we standardise across our client engagements:

Step 1: Provision Infrastructure with Terraform

Terraform provisions the new cluster — VPC, subnets, node pools, IAM roles, and the Kubernetes control plane itself. Because Terraform state is stored remotely (in S3 or equivalent), the same infrastructure can be recreated identically in any region. For organisations building production-ready EKS clusters on AWS, this step is already codified.

Step 2: Bootstrap ArgoCD

ArgoCD is installed on the new cluster and configured to point at the Git repositories containing your application definitions. ArgoCD’s disaster recovery documentation recommends exporting ArgoCD application definitions and storing them in Git alongside your application manifests.

Step 3: ArgoCD Reconciles Application State

Once connected to Git, ArgoCD automatically deploys all applications, services, config maps, CRDs, and their dependencies in the correct order. This is the power of declarative configuration: the desired state is the source of truth, not the running cluster.

Step 4: Velero Restores Persistent Data

While ArgoCD handles stateless application configuration, Velero restores persistent volume data from cross-region backups. This covers databases, file storage, and any other stateful workloads.

Step 5: DNS Failover and Validation

After applications are running and data is restored, DNS records are updated to point to the new cluster. Health checks confirm that services are responding correctly before traffic is fully shifted.

What GitOps Cannot Recover

GitOps recovers declarative state beautifully, but it has blind spots:

Persistent volume data — Git stores configuration, not database rows. You still need Velero or equivalent.
In-flight transactions — Requests that were being processed at the moment of failure are lost unless your application implements idempotency.
External state — DNS records, cloud IAM policies outside Terraform, third-party SaaS configurations, and manual network routes are not captured in your Git repository.
Runtime drift — Any changes applied with kubectl that were not committed to Git will be lost. This is why enforcing GitOps discipline is a DR concern, not just a workflow preference.

Kubernetes Backup Tools Compared

The tooling landscape for Kubernetes backup and DR has matured significantly. Here is our assessment of the leading options based on production deployments we have managed.

Velero

Velero is the CNCF open-source standard for Kubernetes backup and restore. It captures Kubernetes objects and persistent volumes, storing them in cloud object storage. Velero supports scheduled backups, cross-region restores, and resource filtering.

Strengths: Free, widely adopted, strong community, integrates with all major cloud providers.

Limitations: Only cluster administrators can perform backups (no delegated RBAC), limited application-awareness, single encryption key for all backups, no built-in immutable backup support.

Kasten K10 (Veeam)

Kasten K10 is an enterprise-grade backup platform acquired by Veeam. It holds the largest market mindshare at 36.3% among container backup solutions. Kasten provides a web-based dashboard, granular RBAC, application-aware backups, and ransomware protection through immutable backups.

Strengths: Enterprise features, delegated backup/restore to application teams, envelope encryption with unique keys per backup, policy-driven automation, strong compliance support.

Limitations: Commercial licensing, more complex deployment, higher resource overhead.

Portworx PX-Backup

Portworx, from Pure Storage, offers zero-RPO disaster recovery with synchronous replication and application-aware backups. It is tightly integrated with the Portworx storage platform.

Strengths: Zero-RPO capability, application-aware snapshots, database-as-a-service features, multi-cloud migration support.

Limitations: Requires Portworx storage layer, higher cost, vendor lock-in risk.

TrilioVault for Kubernetes

Trilio provides application-centric backup and recovery with a focus on capturing the complete application state, including metadata, configuration, and data.

Strengths: Application-centric approach, Helm-aware backups, continuous restore capability, no storage lock-in.

Limitations: Smaller community than Velero or Kasten, newer entrant in the market.

CloudCasa

CloudCasa (from Catalogic Software) offers a SaaS-based approach to Kubernetes backup with a free tier for smaller environments.

Strengths: SaaS model reduces operational overhead, free tier available, agentless discovery, cross-cluster restore.

Limitations: Data leaves your environment (potential compliance concern), feature limitations on the free tier.

Tool Comparison

Feature	Velero	Kasten K10	Portworx	TrilioVault	CloudCasa
Licence	Open Source	Commercial	Commercial	Commercial	Freemium
Mindshare	21.1%	36.3%	—	—	—
App-Aware	Limited	Yes	Yes	Yes	Partial
Immutable Backups	No	Yes	Yes	Yes	Yes
Delegated RBAC	No	Yes	Yes	Partial	Yes
Multi-Cluster	Manual	Yes	Yes	Yes	Yes
Encryption	Single key	Per-backup keys	Per-backup keys	Per-backup keys	Managed

For most organisations, we recommend starting with Velero for non-critical workloads and evaluating Kasten K10 for production Tier 1 workloads that require enterprise features, compliance, and delegated access.

Disaster Recovery on Managed Kubernetes

Each major cloud provider offers native backup capabilities for their managed Kubernetes service. Understanding these is essential for building a provider-aligned DR strategy.

Amazon EKS

AWS Backup now provides native support for Amazon EKS, eliminating the need for custom scripts or third-party tooling for basic backup scenarios. AWS Backup can protect EKS cluster resources (deployments, services, config maps) and persistent volumes backed by EBS.

Key capabilities:

Policy-driven backup schedules with retention rules
Cross-region and cross-account backup copies for DR
Integration with AWS Organisations for centralised backup governance
Point-in-time recovery for EBS-backed persistent volumes

For organisations running EKS, we typically layer AWS Backup for infrastructure-level protection with Velero or Kasten for application-level granularity. Our guide on EKS architecture best practices covers the networking and IAM prerequisites that underpin a solid DR foundation.

Azure AKS

Azure Backup for AKS provides a managed backup experience integrated with the Azure Backup vault. It supports backup and restore of both cluster state and persistent volumes.

Key capabilities:

Scheduled and on-demand backups through Azure Backup policies
Granular restore at the namespace or workload level
Azure Disk and Azure Files snapshot integration
Vault-tier storage with geo-redundancy for cross-region DR

Google GKE

GKE Backup for Google Kubernetes Engine provides a native, fully managed backup and restore service. It captures both configuration and volume data.

Key capabilities:

Backup plans with configurable schedules and retention
Application-level consistency through custom hooks (pre/post backup)
Cross-region restore for multi-region DR architectures
Integration with Google Cloud’s IAM and encryption services

Cross-Cloud Considerations

If your organisation runs Kubernetes across multiple cloud providers, avoid relying solely on provider-native backup tools. A cross-cloud DR strategy requires a provider-agnostic tool like Velero or Kasten that can back up from EKS and restore to GKE, or vice versa.

Building a DR Gameday Playbook

A disaster recovery plan that has never been tested is not a plan — it is a hypothesis. We run quarterly DR gamedays with every client engagement, and we have seen supposedly robust plans fail spectacularly during their first real test.

Here is the gameday framework we use.

Step 1: Define the Scenario

Choose a realistic failure scenario based on your risk assessment. Examples include:

Total loss of the primary cluster (region outage)
etcd corruption or deletion
Accidental namespace deletion by an engineer
Ransomware encrypting persistent volumes
Control plane failure (API server unreachable)

Step 2: Establish Success Criteria

Before running the exercise, define what “success” means:

Cluster restored within the defined RTO
Data loss within the defined RPO
All Tier 1 services passing health checks
Monitoring and alerting operational on the recovered cluster
No customer-visible errors (for Active-Active or Warm Standby tests)

Step 3: Inject the Failure

Use chaos engineering tools to simulate the disaster in a controlled manner:

Chaos Mesh supports pod failures, network partitions, I/O faults, and time skew — all declaratively configured through Kubernetes CRDs.

LitmusChaos provides a hub of pre-built chaos experiments, including pod deletion, node drain, disk fill, and DNS errors.

For a namespace deletion scenario, you might use Chaos Mesh to kill all pods in a namespace while simultaneously deleting the namespace itself, simulating an accidental kubectl delete namespace production event.

Step 4: Execute the Recovery

The on-call team follows the documented runbook to recover. Critical observations during this phase:

Time every step. Where are the bottlenecks?
Note every deviation. Did the team need to improvise? That reveals a gap in the runbook.
Track dependencies. Did the recovery stall waiting for a DNS change, a secrets rotation, or a manual approval?

Step 5: Conduct a Blameless Post-Mortem

After the exercise, the team reviews what happened:

Did recovery meet the RTO and RPO targets?
What steps took longer than expected?
Were any backup artifacts missing, expired, or corrupted?
Did the runbook have gaps or ambiguities?
Were the right people available and informed?

Document every finding and update the runbook before the next gameday.

Automation Goals

Over successive gamedays, aim to automate as much of the recovery as possible. The ideal end state is a single command (or automated trigger) that provisions infrastructure, deploys applications, restores data, and validates health — reducing human error and shaving minutes off your RTO.

Common Kubernetes DR Mistakes

In our years of consulting, we have seen the same mistakes repeated across organisations of every size. Avoiding these pitfalls will save you from painful lessons during an actual incident.

1. Relying on Kubernetes Self-Healing as a DR Strategy

Kubernetes can restart a crashed pod and reschedule workloads off a failed node. It cannot recover from etcd corruption, region-wide outages, or the deletion of an entire namespace. Self-healing is a resilience feature, not a DR strategy. These are fundamentally different concerns.

2. Not Backing Up etcd

On self-managed clusters, we have encountered teams with no etcd backup strategy whatsoever. The etcd datastore is the single most critical component. Without it, your cluster cannot function, and rebuilding from scratch without a snapshot is extraordinarily painful.

3. Ignoring Persistent Volume Data

Backing up Kubernetes objects (deployments, services, config maps) without backing up the data on persistent volumes is like backing up a database schema without the data. When you restore, your applications start, but they have no data to serve.

4. Never Testing the Recovery Process

This is the most dangerous and most common mistake. Teams configure Velero, see green status indicators, and assume DR is handled. But untested backups are Schrodinger’s backups — you do not know if they work until you try to restore. We have seen corrupt backups, expired cloud credentials, and misconfigured IAM policies all surface during the first real test. For a broader view of operational pitfalls, see our post on Kubernetes mistakes to avoid in production.

5. Storing Backups in the Same Region as Production

If your backups reside in the same region as your production cluster, a regional outage takes out both. Always replicate backups to a geographically separate region, and verify that cross-region restore actually works.

6. No Documented Runbook

When a disaster strikes at 3 AM, the on-call engineer should not be improvising. A clear, step-by-step runbook with commands, expected outputs, and escalation procedures is essential. We create runbooks as living documents that are updated after every gameday exercise.

7. Forgetting About Secrets and Certificates

Secrets that are not backed up or that are backed up without encryption are a dual risk. You either lose access to external services during recovery, or you expose sensitive credentials in your backup storage. Manage secrets through an external secrets manager and ensure TLS certificates (including their private keys) are recoverable. Our guide on Kubernetes secrets management covers this in depth.

8. Neglecting CRDs and Operator State

Custom Resource Definitions power the operator ecosystem — cert-manager, Istio, Prometheus Operator, and many others. If your backup does not include CRDs and their associated custom resources, restoring the cluster will leave these operators broken.

9. Treating DR as a One-Time Project

Infrastructure changes. New services are deployed, new databases are provisioned, backup retention policies expire. A DR plan that was valid six months ago may have critical gaps today. We schedule quarterly reviews of DR plans alongside quarterly gamedays. For organisations strengthening their overall security posture, our Kubernetes security best practices guide provides complementary hardening strategies.

10. No Communication Plan

DR is not purely a technical exercise. Stakeholders need to know what happened, what the impact is, and when services will be restored. Define communication templates, escalation chains, and status page update procedures as part of your DR playbook.

Putting It All Together

Kubernetes disaster recovery is not a single tool or a single document. It is a practice — a combination of architecture decisions, tooling, automation, and regular testing that together ensure your organisation can survive and recover from the worst.

Here is our recommended approach, summarised:

Classify workloads into Tiers 1, 2, and 3 based on business impact
Define RTO and RPO for each tier
Select the appropriate DR pattern (Active-Active through Backup-Restore) per tier
Implement GitOps to make cluster state declarative and recoverable from Git
Deploy backup tooling (Velero for open source, Kasten for enterprise) to protect persistent data and cluster state
Replicate backups cross-region and validate restore regularly
Document everything in a runbook with specific commands, expected outputs, and escalation paths
Run quarterly gamedays with chaos engineering tools to validate the plan
Iterate — update the plan after every gameday, every infrastructure change, and every incident

With 90% of containerised deployments now running on Kubernetes, the stakes have never been higher. The organisations that invest in DR today are the ones that will still be operating tomorrow.

Build a Kubernetes DR Strategy That Survives Real Disasters

A disaster recovery plan is only as good as its last successful test. Too many organisations discover gaps in their DR strategy during an actual incident — when the cost of failure is measured in lost revenue, damaged reputation, and engineering hours spent in crisis mode.

Our team provides comprehensive Kubernetes consulting services to help you:

Design and implement tiered DR architectures with GitOps-driven recovery, cross-region backup replication, and automated failover aligned to your RTO and RPO targets
Build and run DR gameday programmes using Chaos Mesh and LitmusChaos to validate your recovery procedures quarterly, with detailed runbooks your on-call team can execute under pressure
Select, deploy, and optimise backup tooling across EKS, AKS, and GKE, integrating Velero, Kasten, or cloud-native backup services into your existing platform engineering workflows

We have built and tested DR playbooks for production Kubernetes clusters across industries including finance, healthcare, and SaaS. Every engagement starts with a risk assessment and ends with a proven, tested recovery plan.

Talk to our Kubernetes DR specialists —>

Understanding RTO and RPO for Kubernetes Workloads

Tier 1: Mission-Critical

Tier 2: Business-Important

Tier 3: Non-Critical

What to Back Up in Kubernetes

etcd

Persistent Volumes (PVs)

Manifests and Configuration

Custom Resource Definitions (CRDs)

Secrets

Helm Release Metadata

DR Patterns Compared

Backup-Restore

Pilot Light

Warm Standby (Active-Passive)

Active-Active

Pattern Comparison

GitOps-Driven Disaster Recovery

The GitOps Recovery Workflow

What GitOps Cannot Recover

Kubernetes Backup Tools Compared

Velero

Kasten K10 (Veeam)

Portworx PX-Backup

TrilioVault for Kubernetes

CloudCasa

Tool Comparison

Disaster Recovery on Managed Kubernetes

Amazon EKS

Azure AKS

Google GKE

Cross-Cloud Considerations

Building a DR Gameday Playbook

Step 1: Define the Scenario

Step 2: Establish Success Criteria

Step 3: Inject the Failure

Step 4: Execute the Recovery

Step 5: Conduct a Blameless Post-Mortem

Automation Goals

Common Kubernetes DR Mistakes

1. Relying on Kubernetes Self-Healing as a DR Strategy

2. Not Backing Up etcd

3. Ignoring Persistent Volume Data

4. Never Testing the Recovery Process

5. Storing Backups in the Same Region as Production

6. No Documented Runbook

7. Forgetting About Secrets and Certificates

8. Neglecting CRDs and Operator State

9. Treating DR as a One-Time Project

10. No Communication Plan

Putting It All Together

Build a Kubernetes DR Strategy That Survives Real Disasters

Related Articles

Cilium vs Calico: We Run Both in Production - Here's What Won (2026)

Istio vs Linkerd: We Run Both in Production - Here's What Won (2026)

Kubernetes Networking: CNI, Service Mesh, and Network Policies (Deep Dive)

Kubernetes Multi-Tenancy: We Secured 20+ Shared Clusters (Guide)

Application Security Monitoring 2026: Complete Guide to Securing Modern Applications

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation