Engineering

The 10-Layer Monitoring Framework That Saved Our Clients From 3am Pages

Amjad Syed - Founder & CEO

I have been called at 3am more times than I would like to admit. A payment system down during Black Friday. A database silently filling up until it crashed. A certificate that expired on a Sunday morning. Each incident taught me something about what we should have been watching.

After a decade of production incidents and setting up monitoring for dozens of clients across startups and enterprises, I have developed a framework that actually works. Most guides tell you to use Prometheus and Grafana. That is not wrong. But it does not tell you what to actually monitor.

This is the 10-layer monitoring framework we implement for our clients. It comes from years of learning what breaks in production and what warning signs to watch for. Every layer here exists because we missed it once and paid the price.

Every environment is different. But these layers cover the fundamentals that apply to most Kubernetes setups and non-Kubernetes environments like VMs.

The Layers

We break down monitoring into layers. Each layer answers a different question. If you skip a layer, you will have blind spots.

Here is how we think about it:

  1. System and Infrastructure
  2. Application Performance
  3. HTTP, API, and Real User Monitoring
  4. Database
  5. Cache
  6. Message Queues
  7. Tracing Infrastructure
  8. SSL and Certificates
  9. External Dependencies
  10. Log Patterns and Errors

Let me walk through each one.

See It in Action

We recently used this exact framework to replace Nagios with Prometheus for a B2B company running 400+ servers. They went from periodic check delays to real-time monitoring with auto-discovery. Issues that used to take hours to detect now trigger alerts in seconds.

Layer 1: System and Infrastructure

This is the foundation. If the underlying infrastructure is unhealthy, nothing else matters.

We monitor two levels here: the nodes where pods run, and the pods themselves.

Node Level

Your pods run on nodes. If a node is struggling, your pods will struggle too.

We use Prometheus with Node Exporter to collect these metrics:

  • CPU usage and load average
  • Memory usage and available memory
  • Disk usage and disk I/O
  • Network I/O
  • Node up or down status
  • Kubelet health

A common mistake is only watching pod metrics. I remember a 2am call where a client’s e-commerce app kept crashing during a flash sale. Pod metrics looked perfectly fine. CPU normal, memory normal. We spent an hour chasing application bugs that did not exist. Turns out the node had 98% disk usage from accumulated container logs that nobody was rotating. The app was failing because it could not write temp files. We spent two hours chasing the wrong problem because we were not watching the node. Never again.

Pod and Container Level

For pods, we track:

  • Pod up or down status
  • Container restart counts
  • Resource requests vs actual usage
  • Whether pods are hitting their memory or CPU limits

Kubernetes Error States

Kubernetes has specific error states that tell you something is wrong. We alert on these:

  • CrashLoopBackOff - Container keeps crashing and restarting. Something is broken in the app or config.
  • ImagePullBackOff - Cannot pull the container image. Registry issue or wrong image name.
  • OOMKilled - Container ran out of memory and got killed. Need to increase limits or fix a memory leak.
  • Pending pods - Pod is stuck and cannot be scheduled. Usually a resource or node selector issue.
  • Evicted pods - Node ran out of resources and kicked out the pod.
  • Failed liveness or readiness probes - App is not responding to health checks.

If you see CrashLoopBackOff in production, you want to know immediately. Not when a user complains.

Layer 2: Application Performance

System metrics tell you if the infrastructure is healthy. Application metrics tell you if the code is behaving.

This is where APM tools shine. We implement different tools based on client needs and budget:

Full APM platforms (metrics, traces, logs, errors):

  • New Relic - Great free tier with 100GB/month. Good for startups. Shows code-level details, transaction traces, and error tracking out of the box.
  • Datadog APM - Full stack visibility if you are already using Datadog for infrastructure. Gets expensive at scale.
  • Dynatrace - Enterprise grade with AI-powered root cause analysis. Higher price point but less manual setup.
  • SigNoz - Open source full APM alternative. Metrics, traces, and logs in one tool. Self-hosted.
  • Elastic APM - Part of the Elastic stack. Good if you already use Elasticsearch.

Distributed tracing only:

  • Jaeger - Tracing for microservices. Does not do metrics or logs. Pairs well with Prometheus for a complete setup.
  • Zipkin - Similar to Jaeger. Lightweight tracing.
  • Grafana Tempo - Tracing backend that integrates with Grafana. Works with Loki and Prometheus.

Instrumentation:

  • OpenTelemetry - Vendor neutral. Instrument once, send data to any backend.

What we track with APM:

  • Response times for each endpoint
  • Error rates
  • Transaction traces through the system
  • Slow database queries
  • Slow external API calls

The goal is to see what the application is actually doing. When a request is slow, you want to know if it is the code, the database, or an external service.

A trace showing a 3 second database query is more useful than a generic “high latency” alert.

Layer 3: HTTP, API, and Real User Monitoring

This layer is about understanding what users actually experience. There are three parts to this: synthetic monitoring, API monitoring, and real user monitoring. Each tells you something different.

Synthetic Monitoring (Blackbox Probes)

This is blackbox monitoring with Prometheus. We do not care how the app works internally. We just check if it responds correctly from the outside.

What we probe:

  • Health check endpoints
  • Critical user flows like login, checkout, or search
  • Response status codes
  • Response latency

We run these probes from multiple locations. Your app might work fine from inside the cluster but be unreachable from the internet. I once debugged an issue where the app worked perfectly from our monitoring in AWS but users in Europe could not connect. Turned out a CDN edge node was misconfigured. Multi-region probing caught what single-location monitoring would have missed.

If the health endpoint returns 200, great. If it returns 500 or times out, something is wrong.

API Monitoring and Testing

Blackbox probes tell you if an endpoint is up. API monitoring tells you if it is working correctly. There is a difference between “the endpoint responded” and “the endpoint returned the right data.”

Tools we use:

  • Checkly - Our go-to. Playwright-based synthetic monitoring with API checks. Can test full user flows and API sequences. Great Terraform integration for monitoring-as-code.
  • Runscope (now part of BlazeMeter) - Dedicated API testing and monitoring. Good for complex API workflows with assertions on response data.
  • Postman Monitors - If your team already uses Postman for API development, monitors let you schedule your collections as automated tests.
  • Assertible - Focused on API testing with strong assertion capabilities. Good for contract testing.

What we test:

  • API response structure matches expected schema
  • Authentication flows work end to end
  • Data returned is valid, not just status 200 with empty response
  • API sequences work in order (create, read, update, delete)
  • Error responses return proper error codes and messages

We had a client whose health endpoint returned 200 while the actual API was returning empty arrays for all queries. Synthetic probes said everything was fine. API monitoring with data assertions caught the issue in minutes.

Real User Monitoring (RUM)

Synthetic monitoring tests from servers. RUM shows what actual users experience in their browsers. This is the layer most teams skip, and they should not.

Your API might respond in 50ms but if the browser takes 4 seconds to render the page because of heavy JavaScript, users are frustrated and you would never know from server-side metrics.

Tools we recommend:

  • Google Analytics 4 - Free and you probably already have it. GA4 tracks Core Web Vitals out of the box. Not as detailed as dedicated RUM tools but a solid starting point.
  • Datadog RUM - Full featured. Session replay, Core Web Vitals, error tracking. Integrates well if you already use Datadog.
  • New Relic Browser - Similar to Datadog RUM. Good free tier makes it accessible for startups.
  • LogRocket - Session replay with frontend error tracking. See exactly what users saw when something went wrong.
  • Sentry - More error tracking than RUM, but excellent for catching frontend JavaScript errors with full stack traces.
  • SpeedCurve - Focused on performance budgets and Core Web Vitals. Good for teams serious about frontend performance.

What we track:

  • Core Web Vitals (LCP, FID, CLS)
  • Page load times by geography and device
  • JavaScript errors in production
  • User session flows and drop-off points
  • Time to interactive for critical pages

A slow backend shows up in APM. A slow frontend only shows up in RUM. You need both to see the full picture.

Layer 4: Database

Databases cause a lot of production issues. They deserve their own monitoring layer.

Tools we use:

What we track:

  • Active connections vs connection pool size
  • Query latency
  • Slow queries
  • Replication lag (if you have replicas)
  • Lock waits and deadlocks
  • Disk and memory usage
  • Database up or down status

Connection pool exhaustion is a classic issue and I have seen it take down production more times than any other database problem. One client had a checkout flow that would randomly fail. No pattern we could find. Turned out a slow third-party API call was holding database connections open for 30 seconds at a time. During traffic spikes, the pool would drain completely. Users would see “unable to process payment” errors while the app waited for a connection that never came. If we had been tracking connection usage, we would have seen the pool creeping toward exhaustion hours before the first user complained.

Replication lag matters if you read from replicas. A replica that is 30 seconds behind will serve stale data.

Layer 5: Cache

We use Redis heavily. When Redis has issues, the app slows down or fails. Same applies to Memcached or managed services like ElastiCache.

Tools we use:

  • Prometheus Redis Exporter - Exposes all the Redis metrics you need. Memory, connections, hit rates, replication status.
  • Redis INFO command - Built in. We scrape this directly in some setups.
  • Prometheus Memcached Exporter - If you use Memcached instead of Redis.
  • CloudWatch metrics for ElastiCache - If you run managed Redis on AWS, CloudWatch gives you the basics without extra exporters.

What we track:

  • Hit and miss ratio
  • Memory usage
  • Eviction rate
  • Connection count
  • Redis up or down status

A dropping hit ratio means your cache is not working well. Either keys are expiring too fast or you are caching the wrong things.

High eviction rate means Redis is running out of memory and throwing away data. You need more memory or better cache policies.

Layer 6: Message Queues

We use message queues for async processing. If the queue backs up, work is not getting done.

Tools we use:

  • Prometheus Kafka Exporter - Consumer lag, partition offsets, broker status. Essential for Kafka.
  • Burrow - LinkedIn’s Kafka consumer lag checker. More sophisticated lag analysis than the basic exporter.
  • RabbitMQ Prometheus Plugin - Built into RabbitMQ 3.8+. Exposes queue depth, message rates, and connection counts.
  • Prometheus SQS Exporter - For AWS SQS queues. Queue depth and message age.
  • CloudWatch for managed queues - SQS, SNS, and Amazon MQ all expose metrics through CloudWatch.

What we track:

  • Queue depth and lag
  • Consumer lag
  • Messages per second in and out
  • Dead letter queue size
  • Queue up or down status

Consumer lag is the big one. If your consumers cannot keep up with producers, the lag grows. Eventually you have a backlog of thousands of messages and users wondering why their jobs are not processing.

Dead letter queues catch failed messages. If that queue is growing, something is failing repeatedly.

Need a second opinion on your setup?

We audit Kubernetes monitoring for free. 30 minutes, no pitch, just honest feedback on what you're missing.

Book a Free Review

Layer 7: Tracing Infrastructure

We run Jaeger for distributed tracing. But Jaeger itself needs monitoring.

What we track:

  • Collector health and up or down status
  • Span ingestion rate
  • Storage backend health
  • Dropped spans

If Jaeger is down, you lose visibility into your traces. You will not know until you need a trace and it is not there.

Dropped spans mean Jaeger cannot keep up. You are losing data.

Layer 8: SSL and Certificates

Expired certificates cause outages. Every single time it is embarrassing because it was preventable.

What we track:

  • Certificate expiry dates
  • Days until expiry (alert at 30, 14, and 7 days)
  • Domain validation status
  • TLS version

We alert at 30 days before expiry. That gives plenty of time to renew. Some teams wait until 7 days. That is asking for trouble if someone is on vacation.

Layer 9: External Dependencies

Your app probably depends on services you do not control. Payment providers. Auth services. Third party APIs. CDNs. DNS providers. The list gets long.

I have lost count of how many times a client called us thinking their app was broken when it was actually Stripe having a bad day or Cloudflare having a regional outage. Knowing instantly that the problem is external saves hours of debugging.

Tools we use:

  • StatusGator - Aggregates status pages from hundreds of services into one dashboard. Get Slack alerts when your dependencies have incidents.
  • Instatus - Similar to StatusGator. Clean UI and good integrations.
  • Hyperping - Status page monitoring with incident tracking.
  • Your own monitoring - We also probe external APIs ourselves because status pages are not always accurate or fast to update.

What we track:

  • Response times from external APIs
  • Error rates from external calls
  • Availability of critical third party services
  • Status page updates from dependencies (via aggregators)
  • SSL certificate expiry on third party endpoints we call

When Stripe or Auth0 has issues, you want to know before your users do. Sometimes the problem is not your code. It is a dependency.

We keep a simple Grafana dashboard showing the health of all external services we depend on. One glance tells us if the problem is inside or outside our control.

Layer 10: Log Patterns and Errors

Metrics tell you something is wrong. Logs tell you what is wrong.

We use centralized log management to monitor:

Error rate changes

  • Sudden spike in 5xx errors
  • Unusual increase in 4xx errors
  • New error types appearing

Specific error patterns

We search for patterns that indicate real problems:

  • “timeout” - Something is taking too long
  • “connection refused” - Cannot connect to a service
  • “deadlock” - Database contention issue
  • “out of memory” - Memory pressure
  • “disk full” - Storage issue
  • “connection pool exhausted” - Need more connections
  • “circuit breaker open” - Downstream service is failing

When we see a spike in timeout errors, we know to look at network or downstream services. When we see deadlock patterns, we check the database.

Pattern matching on logs catches issues that metrics miss.

What We Do Not Monitor

Being honest here. We do not monitor everything.

We skip:

  • Per request logging in high traffic endpoints (too expensive)
  • Debug level logs in production (too noisy)
  • Metrics with high cardinality that blow up storage costs

There are trade offs. We accept some blind spots to keep costs reasonable and dashboards usable.

Alerting Philosophy

Not everything needs an alert. We follow a simple rule:

Alert on symptoms, not causes.

High CPU does not always mean a problem. Users getting errors is always a problem.

We have three levels:

  1. Page someone - Users are affected right now. 5xx errors, service down, data loss risk.
  2. Slack notification - Something needs attention today. Certificate expiring in 14 days, disk at 80%.
  3. Just log it - Interesting but not urgent. High CPU that did not cause user impact.

If an alert fires and we do nothing about it, we delete the alert. Alert fatigue is real. Every alert should mean something.

Wrapping Up

This is the framework we have refined over years of implementing monitoring for clients. Ten layers covering infrastructure, applications, dependencies, and errors.

The key is not having fancy tools. It is knowing what questions to ask:

  • Is the infrastructure healthy?
  • Is the application behaving?
  • Can users actually use it?
  • Are the dependencies working?
  • What errors are happening?

If you can answer those questions from your monitoring, you are in good shape.

Happy to answer questions if you have them.


Get the Monitoring Checklist

I have turned this framework into a practical checklist you can use during your next monitoring audit. It includes:

  • All 10 layers with specific metrics to track
  • Tool recommendations for each layer (with free and paid options)
  • Alert threshold suggestions based on what we use in production
  • Common mistakes to avoid at each layer

Get the 10-Layer Monitoring Checklist - No email required.


Need Help Setting This Up?

We implement this monitoring framework for teams running Kubernetes in production. Startups to enterprises, cost-effective to full observability.

Two ways to get started:

  1. Free monitoring audit - Book a 30-minute call and we review your current setup live.

  2. Full implementation - We design and deploy the whole stack: Prometheus, Grafana, alerting, dashboards. Get in touch.

Chat with real humans
Chat on WhatsApp