Databases on Kubernetes: We Run 200+ Clusters - Here's What Works (2026)

The debate about whether databases belong on Kubernetes is over. According to the Data on Kubernetes 2024 Report, 72% of organisations now use Kubernetes for database management, up from a minority just a few years ago. Meanwhile, 98% of respondents run data-intensive workloads on cloud-native platforms. The question has shifted from “should we?” to “how do we do it properly?”

At Tasrie IT Services, we have deployed and managed databases on over 200 Kubernetes clusters for clients across financial services, healthcare, e-commerce, and SaaS. We have seen what works in production and what causes 3am pages. This guide distils our hands-on experience into a practical reference covering StatefulSets, database operators, storage architecture, cost analysis, and a decision framework for choosing between managed services and Kubernetes-native databases.

StatefulSet Fundamentals

A StatefulSet is the Kubernetes workload controller purpose-built for stateful applications. Unlike Deployments, which treat pods as interchangeable cattle, StatefulSets treat each pod as a distinct entity with a stable identity that persists across restarts and rescheduling.

Stable Network Identities

Every pod in a StatefulSet receives a predictable name following the pattern $(statefulset-name)-$(ordinal). A PostgreSQL StatefulSet named pg-cluster produces pods named pg-cluster-0, pg-cluster-1, and pg-cluster-2. This predictability is critical for databases: your application can always connect to the primary at pg-cluster-0 and read replicas at pg-cluster-1 or pg-cluster-2.

StatefulSets require a Headless Service (clusterIP: None) that provides DNS entries for each pod. The resulting DNS pattern is:

$(pod-name).$(service-name).$(namespace).svc.cluster.local

This means pg-cluster-0.pg-service.databases.svc.cluster.local always resolves to the same logical pod, regardless of which node it runs on.

Ordered Deployment and Scaling

By default, StatefulSets use the OrderedReady pod management policy. Pods are created sequentially — pod-0 must be running and ready before pod-1 starts. Termination happens in reverse order. This matters for databases where the primary must initialise before replicas attempt to join the cluster.

For databases with built-in cluster membership protocols (CockroachDB, TiDB), you can use the Parallel policy instead, allowing all pods to start simultaneously and reducing scale-up time considerably.

VolumeClaimTemplates

The defining storage feature of StatefulSets is volumeClaimTemplates. Rather than sharing a single PersistentVolumeClaim across all pods, each pod gets its own dedicated PVC:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3-encrypted
      resources:
        requests:
          storage: 100Gi

When pg-cluster-0 is created, Kubernetes provisions a PVC named data-pg-cluster-0. If the pod is deleted and recreated, it reattaches to the same PVC, preserving all data. This per-pod storage isolation is non-negotiable for databases where each replica maintains its own data directory.

One important caveat: Kubernetes does not allow you to modify volumeClaimTemplates after creation. If you need to resize volumes, the workaround is to manually patch each PVC’s spec.resources.requests.storage, delete the StatefulSet with --cascade=orphan (pods keep running with zero downtime), update the YAML with the new volume size, and reapply. It is awkward, but it works reliably in production.

StatefulSet vs Deployment: When to Use Which

We regularly encounter teams using Deployments for stateful workloads because they are more familiar. This table clarifies when each controller is appropriate:

Feature	Deployment	StatefulSet
Pod naming	Random hash (e.g., `app-7f8d9c`)	Ordinal index (e.g., `app-0`, `app-1`)
Storage	Shared PVC across all pods	Dedicated PVC per pod
Scaling order	Simultaneous	Sequential (ordered)
Network identity	Ephemeral, changes on restart	Stable DNS per pod
Use case	Stateless APIs, web servers	Databases, message queues, consensus systems
Pod replacement	New pod, new identity	Same ordinal, same PVC, same DNS

Use a Deployment when your application stores no local state, all pods are identical, and any pod can handle any request. Use a StatefulSet when pods need stable identities, ordered startup or shutdown, or per-pod persistent storage.

That said, for production database workloads, raw StatefulSets alone are rarely sufficient. That is where operators come in.

Why Operators Are Better Than Raw StatefulSets for Databases

A StatefulSet gives you stable identities and persistent storage. It does not give you automated failover, backup scheduling, point-in-time recovery, connection pooling, or rolling upgrades that respect replication lag. These are all essential for running databases in production.

Database operators extend the Kubernetes API with Custom Resource Definitions (CRDs) that encode operational knowledge. Instead of writing shell scripts to handle primary promotion when a pod fails, the operator watches the cluster state and acts autonomously. Here is what a mature database operator handles that raw StatefulSets cannot:

Automated failover: Detects primary failure and promotes a replica within seconds
Backup and PITR: Schedules base backups and continuous WAL archiving to object storage
Replica management: Adds and removes read replicas declaratively
Configuration management: Applies PostgreSQL/MySQL configuration changes safely with rolling restarts
Version upgrades: Performs minor and major version upgrades with automated pre-flight checks
Connection pooling: Integrates PgBouncer or ProxySQL as sidecar containers
Monitoring integration: Exposes Prometheus metrics endpoints automatically

In our experience, teams that attempt to run production databases with raw StatefulSets inevitably build a bespoke operator over time — just one that is untested, undocumented, and maintained by a single engineer. Use an established operator from the start.

The Database Operator Landscape in 2026

The operator ecosystem has matured considerably. Here is our assessment of the leading options across database engines, based on what we deploy for clients.

PostgreSQL Operators

PostgreSQL has the richest operator ecosystem. Four operators dominate production deployments:

Operator	Maintainer	HA Approach	Key Differentiator
CloudNativePG	EDB / CNCF	Built-in Instance Manager	CNCF Sandbox project, no Patroni dependency, fastest-growing community (4,300+ GitHub stars)
Zalando Postgres Operator	Zalando	Patroni-based	Battle-tested at scale since 2017, but community activity is declining
CrunchyData PGO	Crunchy Data	Patroni-based	Enterprise-focused, comprehensive monitoring integration
Percona Operator for PostgreSQL	Percona	Patroni-based	Multi-database vendor with unified tooling for PostgreSQL, MySQL, and MongoDB

Our recommendation: For new deployments in 2026, we default to CloudNativePG. Its architecture eliminates the Patroni dependency, reducing moving parts. It is a CNCF project with strong community momentum, and its declarative backup configuration with native PITR support is production-ready. For organisations already invested in the Percona ecosystem across multiple database engines, the Percona Operator provides a consistent management experience.

MySQL Operators

Operator	Key Features
Percona Operator for MySQL	Synchronous replication, PITR, zero-downtime upgrades, automated backups
MySQL Operator for Kubernetes (Oracle)	Official Oracle operator, InnoDB Cluster management
Vitess	CNCF Graduated project, horizontal sharding for MySQL, used by YouTube and Slack at massive scale

For MySQL workloads that need horizontal sharding, Vitess remains the gold standard. For standard MySQL with replication, the Percona Operator offers the most comprehensive feature set.

MongoDB Operators

Operator	Key Features
MongoDB Controllers for Kubernetes (MCK)	Unified replacement for Community + Enterprise operators (launched 2025), sharding support, integrated backups
Percona Operator for MongoDB	Incremental physical backups, hidden nodes, PITR, multi-storage support, automated user management

Note that MongoDB archived its Community Operator in December 2025, replacing it with the unified MCK. If you are still using the legacy operator, plan your migration now.

Redis and Caching

For Redis-compatible workloads, Dragonfly Operator has emerged as a compelling option with automated replication and failover. Redis Enterprise Operator remains available for organisations with existing Redis Enterprise licences.

Distributed SQL (Kubernetes-Native)

Databases designed from the ground up for Kubernetes deserve special mention:

CockroachDB: Distributed SQL with built-in replication, designed for horizontal scaling by adding pods
TiDB: NewSQL, MySQL-compatible, proven at scale (Ninja Van case study)
YugabyteDB: Distributed SQL, PostgreSQL-compatible, strong Kubernetes integration

These databases handle sharding, replication, and failover internally, making them excellent fits for Kubernetes even with simpler operator requirements.

Storage Considerations for Databases on Kubernetes

Storage is where database-on-Kubernetes deployments succeed or fail. Getting this right is more important than choosing the right operator.

CSI Drivers and StorageClasses

Every major cloud provider offers Container Storage Interface (CSI) drivers for their block storage:

AWS: EBS CSI Driver (gp3 for general purpose, io2 for high IOPS)
Azure: Azure Disk CSI Driver (Premium SSD v2 for databases)
GCP: GCE Persistent Disk CSI Driver (pd-ssd for production)

Define explicit StorageClasses rather than relying on the default. A production database StorageClass should specify the volume type, enable encryption, and allow volume expansion:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted-expandable
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  iops: "6000"
  throughput: "250"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Set reclaimPolicy: Retain for database volumes. The default Delete policy destroys the underlying volume when a PVC is removed, which is appropriate for ephemeral workloads but catastrophic for databases. Also note that StatefulSets do not delete PVCs on scale-down or deletion — stale PVCs accumulate silently and incur cost. Build automation to audit and clean up orphaned PVCs.

For organisations requiring third-party storage solutions, the landscape includes Portworx (enterprise-grade with application-aware snapshots), Longhorn (open-source simplicity from SUSE), Rook-Ceph (distributed storage for large-scale clusters), and OpenEBS (container-attached storage with the Mayastor engine for high-performance needs).

Backup Strategies and PITR

A persistent volume is not a backup. Volumes can be corrupted, accidentally deleted, or affected by storage-layer failures. Every database on Kubernetes needs a backup strategy that includes:

Operator-native backups: CloudNativePG and Percona operators support scheduled base backups with continuous WAL/binlog archiving to S3, GCS, or Azure Blob Storage. This enables point-in-time recovery (PITR) to any second within your retention window.
CSI volume snapshots: Fast, storage-native snapshots via CSI drivers. Useful for quick clones and pre-upgrade checkpoints, but not a substitute for application-consistent backups.
Velero: Cluster-level backup of Kubernetes resources and PVC snapshots. Essential for disaster recovery at the cluster level, but database-aware operators provide finer-grained recovery.
Application-consistent hooks: Use pre-snapshot and post-snapshot hooks to quiesce databases (flush writes, create consistent checkpoints) before taking volume snapshots.

We recommend a layered approach: operator-managed PITR as the primary recovery mechanism, CSI snapshots for quick rollbacks, and Velero for full cluster DR. Test your restores monthly — a backup you have never restored is not a backup.

Decision Framework: Managed Services vs Kubernetes-Native

Not every database belongs on Kubernetes. Here is the framework we use with clients to make this decision.

Choose Managed Services (RDS, Cloud SQL, Atlas) When

Your team has limited Kubernetes or DBA expertise and cannot invest in building it
You have simple, predictable database requirements with moderate scale
You are an early-stage startup where operational simplicity outweighs cost
Strict compliance requirements are easier to satisfy with managed service certifications
You are committed to a single cloud provider with no portability needs

Choose Kubernetes-Native Databases When

Cost optimisation is critical: Managed services typically cost 2x or more compared to equivalent Kubernetes deployments (Percona analysis)
You run at scale with many database instances: Operator automation pays for itself when managing tens or hundreds of databases
Multi-cloud or hybrid-cloud portability is a strategic requirement
Your team has strong Kubernetes and DBA expertise (or is willing to invest in it)
You need fine-grained control over configuration, upgrade timing, and data locality
You are already running everything else on Kubernetes and want a unified platform

Databases Well-Suited for Kubernetes

Databases with built-in clustering, sharding, and replication work best: CockroachDB, TiDB, Vitess, MongoDB, and Cassandra handle node membership natively. Caching layers like Redis and Memcached are straightforward. PostgreSQL and MySQL are production-ready with mature operators like CloudNativePG and Percona.

Databases That Remain Challenging

Legacy single-node databases without clustering support, workloads with extreme IOPS requirements where cloud block storage becomes a bottleneck, and databases requiring exotic storage configurations (direct-attached NVMe with shared-nothing architectures) still warrant careful evaluation. For a deeper exploration of cloud-native database selection criteria, see our cloud native database guide.

Cost Comparison: Real Numbers

Cost is often the primary driver for moving databases to Kubernetes. Let us look at concrete figures.

RDS vs Self-Managed PostgreSQL on EKS

Consider a PostgreSQL deployment with 8 vCPUs, 32 GB RAM, 3 TB storage, and high availability:

Component	AWS RDS (Multi-AZ)	Self-Managed on EKS
Compute	~$1,200/month (db.r6g.2xlarge)	~$550/month (r6g.2xlarge reserved)
Storage (3 TB gp3)	~$900/month	~$240/month (EBS gp3)
Backup storage	~$200/month	~$60/month (S3)
Operator/tooling	Included	$0 (open-source operators)
Monthly total	~$2,500/month	~$990/month
Annual total	~$30,000/year	~$11,880/year

That is roughly a 60% cost reduction. Simplyblock’s analysis arrives at similar figures, and Percona’s cost calculator shows that clients typically achieve 50% savings within the first year.

However, these numbers do not include the human cost. Self-managed databases require engineering time for setup, monitoring, incident response, and upgrades. At smaller scale (one to three databases), the operational overhead can negate the infrastructure savings. The economics become compelling at scale — once you have the operator expertise and automation in place, adding the tenth or fiftieth database instance is nearly free from an operational perspective.

For organisations looking to optimise Kubernetes spending holistically, our Kubernetes cost optimisation guide covers strategies beyond database workloads.

StatefulSet Patterns and Anti-Patterns

Over hundreds of deployments, we have catalogued the practices that correlate with stable database operations and the mistakes that lead to incidents.

Patterns (Do This)

Set Pod Disruption Budgets (PDBs): Protect database quorum during node maintenance. For a three-node PostgreSQL cluster, set minAvailable: 2 to ensure Kubernetes never evicts more than one pod simultaneously during rolling updates or node drains.

Use pod anti-affinity: Spread database replicas across nodes and availability zones. Without this, all three replicas can land on the same node, creating a single point of failure:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: pg-cluster
        topologyKey: kubernetes.io/hostname

Use topology spread constraints: For even distribution across failure domains, combine anti-affinity with topologySpreadConstraints targeting availability zones.

Choose the right update strategy: For databases, prefer OnDelete update strategy, which requires you to manually delete pods to trigger updates. This gives you control to verify each replica after upgrade before proceeding. Alternatively, use RollingUpdate with partition set to N-1 for canary upgrades — only the highest-ordinal pod updates first.

Set generous terminationGracePeriodSeconds: Database processes need time to flush writes, complete WAL archiving, and shut down cleanly. The default 30 seconds is often insufficient. We typically set 300-600 seconds for production databases.

Use init containers for initialisation: Schema setup, configuration rendering, and permission fixes belong in init containers, not in the main container’s entrypoint.

Anti-Patterns (Avoid This)

Using Deployments for databases: Pods lose their identity on restart, making it impossible to distinguish primary from replica. This is the most common mistake we encounter. If your team is making this error, our common Kubernetes mistakes guide covers this and other frequent pitfalls.

Storing data in pod-local storage: Without PVCs, all data is lost when a pod restarts. We have seen production databases running on emptyDir volumes — do not let this happen to you.

Ignoring PVC cleanup: StatefulSets do not delete PVCs on scale-down. Scale a cluster from five replicas to three, and you still pay for five volumes. Build automation to identify and remove orphaned PVCs.

Using RollingUpdate without partition: All replicas updating simultaneously can cause split-brain conditions or temporary loss of quorum. Always use partitioned rolling updates for databases.

Skipping backup configuration: PVC persistence is not a backup strategy. Volumes can be corrupted or accidentally deleted. Every database must have operator-managed backups with tested restore procedures.

Insufficient resource requests: Not setting CPU and memory requests (or setting them too low) invites noisy-neighbour problems. Database pods compete with other workloads for resources, leading to unpredictable latency. Always set both requests and limits based on actual workload profiling.

Day-2 Operations: Backup, Monitoring, and Upgrades

Deploying a database is day one. Keeping it running reliably for months and years is where the real work begins.

Backup and Recovery

Configure operator-native backups from day one. With CloudNativePG, a backup schedule looks like this:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-production
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://pg-backups/production/
      s3Credentials:
        accessKeyID:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "30d"

Run restore drills quarterly. Document the restore procedure, measure the time it takes, and ensure it meets your RTO and RPO targets. Our Kubernetes disaster recovery playbook provides a structured approach to DR testing.

Monitoring

Every database operator exposes Prometheus metrics. At minimum, monitor:

Replication lag: Alert when replicas fall behind the primary
Connection pool saturation: Alert before connections are exhausted
Storage usage and growth rate: Alert at 80% capacity with projected time to full
WAL archiving status: Alert on archiving failures (they indicate backup gaps)
Query latency (p99): Alert on latency regressions

Integrate these with your existing Kubernetes monitoring stack rather than building separate dashboards for each database.

Upgrades

Minor version upgrades (e.g., PostgreSQL 16.2 to 16.3) are typically automated by operators with rolling restarts. Major version upgrades (e.g., PostgreSQL 15 to 16) require more care:

Take a full backup before starting
Test the upgrade on a clone of the production database
Use the operator’s built-in upgrade mechanism (CloudNativePG supports in-place major upgrades)
Monitor replication lag and application errors during the rollout
Keep the previous backup available for rollback

Plan maintenance windows for major upgrades even if the operator promises zero downtime. We have seen edge cases where application connection handling does not cope gracefully with primary failover during upgrades.

Adoption Statistics: The Industry Has Moved

The CNCF survey data and DoK 2024 Report paint a clear picture of where the industry stands:

93% of organisations now use, pilot, or evaluate Kubernetes
80% have deployed Kubernetes in production (up from 66% in 2023)
72% use Kubernetes for database management
67% use Kubernetes for analytics workloads
54% run AI/ML workloads on Kubernetes
35% cite technical complexity as the top remaining barrier

The shift is not hypothetical. Major organisations — financial institutions, healthcare providers, and technology companies — are running production databases on Kubernetes today. The tooling has caught up with the ambition. CNCF-backed projects like CloudNativePG and Vitess, combined with mature commercial operators from Percona and Crunchy Data, have closed the operational gap that made databases on Kubernetes risky just a few years ago.

As The New Stack reported, Kubernetes has “finally solved its biggest problem” — managing databases. The combination of mature operators, reliable CSI storage, and battle-tested patterns means organisations that avoid running databases on Kubernetes are increasingly the outliers, not the norm.

Getting Started: A Pragmatic Path

If your organisation is considering databases on Kubernetes, here is the path we recommend:

Start with non-production: Deploy a staging database using CloudNativePG or Percona Operator. Learn the operator’s CRDs, backup configuration, and failover behaviour.
Invest in storage: Define production StorageClasses with encryption, expansion, and appropriate reclaim policies. Benchmark IOPS to ensure your storage tier meets database requirements.
Build observability first: Set up Prometheus metrics collection and Grafana dashboards for database-specific metrics before going to production.
Run gameday exercises: Simulate node failures, pod evictions, and storage issues. Verify that the operator handles failover correctly and that your backups restore successfully.
Migrate incrementally: Start with lower-risk databases (development tools, internal services) before migrating customer-facing production databases.
Document everything: Runbooks for common operations (scaling, backup restore, major upgrades) should exist before production go-live.

Run Databases on Kubernetes With Confidence

Running databases on Kubernetes is no longer experimental — it is a proven approach used by 72% of organisations managing data workloads. But the difference between a successful deployment and an operational nightmare comes down to architecture decisions made on day one: choosing the right operator, configuring storage correctly, implementing backup and recovery from the start, and building the team expertise to manage it all.

Our team provides comprehensive Kubernetes consulting services to help you:

Design database architectures on Kubernetes with the right operators, storage tiers, and high-availability configurations
Migrate from managed services like RDS and Cloud SQL to self-managed Kubernetes-native databases, cutting infrastructure costs by 50% or more
Implement day-2 operations including automated backups with PITR, monitoring integration, and tested disaster recovery playbooks
Train your engineering team to operate databases on Kubernetes confidently and independently

We have deployed databases on Kubernetes across AWS EKS, Azure AKS, and Google GKE for organisations ranging from early-stage startups to enterprises managing petabytes of data.

Speak with our Kubernetes database specialists today

Databases on Kubernetes: We Run 200+ Clusters - Here's What Works (2026)

StatefulSet Fundamentals

Stable Network Identities

Ordered Deployment and Scaling

VolumeClaimTemplates

StatefulSet vs Deployment: When to Use Which

Why Operators Are Better Than Raw StatefulSets for Databases

The Database Operator Landscape in 2026

PostgreSQL Operators

MySQL Operators

MongoDB Operators

Redis and Caching

Distributed SQL (Kubernetes-Native)

Storage Considerations for Databases on Kubernetes

CSI Drivers and StorageClasses

Backup Strategies and PITR

Decision Framework: Managed Services vs Kubernetes-Native

Choose Managed Services (RDS, Cloud SQL, Atlas) When

Choose Kubernetes-Native Databases When

Databases Well-Suited for Kubernetes

Databases That Remain Challenging

Cost Comparison: Real Numbers

RDS vs Self-Managed PostgreSQL on EKS

StatefulSet Patterns and Anti-Patterns

Patterns (Do This)

Anti-Patterns (Avoid This)

Day-2 Operations: Backup, Monitoring, and Upgrades

Backup and Recovery

Monitoring

Upgrades

Adoption Statistics: The Industry Has Moved

Getting Started: A Pragmatic Path

Run Databases on Kubernetes With Confidence

ClickStack Setup: We Deployed It in Under 15 Minutes (Guide)

ClickStack vs Prometheus: We Ran Both — Here's the Verdict

Cloud Native Database 2026: The Definitive Guide to Modern Data Infrastructure

ArgoCD vs Flux: We Run Both in Production - Here's What Won (2026)

Cilium vs Calico: We Run Both in Production - Here's What Won (2026)

Tasrie IT Support

Start a conversation

StatefulSet Fundamentals

Stable Network Identities

Ordered Deployment and Scaling

VolumeClaimTemplates

StatefulSet vs Deployment: When to Use Which

Why Operators Are Better Than Raw StatefulSets for Databases

The Database Operator Landscape in 2026

PostgreSQL Operators

MySQL Operators

MongoDB Operators

Redis and Caching

Distributed SQL (Kubernetes-Native)

Storage Considerations for Databases on Kubernetes

CSI Drivers and StorageClasses

Backup Strategies and PITR

Decision Framework: Managed Services vs Kubernetes-Native

Choose Managed Services (RDS, Cloud SQL, Atlas) When

Choose Kubernetes-Native Databases When

Databases Well-Suited for Kubernetes

Databases That Remain Challenging

Cost Comparison: Real Numbers

RDS vs Self-Managed PostgreSQL on EKS

StatefulSet Patterns and Anti-Patterns

Patterns (Do This)

Anti-Patterns (Avoid This)

Day-2 Operations: Backup, Monitoring, and Upgrades

Backup and Recovery

Monitoring

Upgrades

Adoption Statistics: The Industry Has Moved

Getting Started: A Pragmatic Path

Run Databases on Kubernetes With Confidence

Related Articles

ClickStack Setup: We Deployed It in Under 15 Minutes (Guide)

ClickStack vs Prometheus: We Ran Both — Here's the Verdict

Cloud Native Database 2026: The Definitive Guide to Modern Data Infrastructure

ArgoCD vs Flux: We Run Both in Production - Here's What Won (2026)

Cilium vs Calico: We Run Both in Production - Here's What Won (2026)

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation