Seeing the Unseen: Mastering Observability, Monitoring, and Alerting
With just 3 days left until Christmas, it’s time to focus on the core of system reliability — Monitoring, Alerting, and Observability. These three pillars are essential for Site Reliability Engineers (SREs) to ensure systems are available, performant, and error-free during the most critical times of the year.
At first glance, monitoring and observability may seem like the same thing. However, understanding the difference between them — and how to build effective alerting systems — can make the difference between a calm holiday season and a nightmare of outages. In this article, we’ll break down the differences, explore essential metrics to track, review the most important tools, and provide best practices for alerting without causing “alert fatigue.”
1. Difference Between Monitoring and Observability
The terms “monitoring” and “observability” are often used interchangeably, but they are distinct concepts. While monitoring focuses on tracking the performance of known issues, observability is about understanding unknown, emergent problems.
Monitoring
Monitoring is the practice of collecting data on known metrics to track system health and performance. It’s proactive and tracks things like uptime, memory usage, and CPU utilization.
Key Characteristics of Monitoring:
- Relies on pre-defined metrics (e.g., latency, error rates, throughput).
- Provides “what happened” information.
- Uses thresholds and alerts to identify issues.
Example: Setting up a monitoring system to alert you if CPU usage exceeds 90%.
Observability
Observability, on the other hand, is about understanding why something is happening. It’s the ability to ask new questions about the system, even if the questions were not anticipated beforehand. Observability relies on three key pillars — logs, metrics, and traces.
Key Characteristics of Observability:
- Enables exploration of unknown issues.
- Relies on logs, metrics, and distributed traces.
- Focuses on “why it happened” rather than “what happened.”
Example: Using distributed tracing to understand which specific service in a microservice architecture is causing increased latency.
Summary of Differences:
Concept | Focus | When Used | Data Sources |
---|---|---|---|
Monitoring | Known issues | Detect & respond to issues | Metrics & Alerts |
Observability | Unknown issues | Investigate root causes | Logs, Metrics, Traces |
Both observability and monitoring are critical for SREs. While monitoring tells you when something is wrong, observability helps you understand why it’s wrong and how to fix it.
2. Key Metrics SREs Need to Track
The next question is — what should you monitor and observe? The Google SRE Handbook introduces the “Four Golden Signals” that every SRE should track:
1. Latency
- What it is: The time it takes for a request to be completed.
- Why it matters: High latency affects user experience and signals underlying issues like slow database queries.
- How to measure it: Track average, p95, and p99 latencies for API calls, database queries, and page loads.
2. Throughput
- What it is: The number of requests the system can handle in a given time.
- Why it matters: Throughput tells you how well the system handles demand and scales.
- How to measure it: Measure the number of requests per second (RPS) or queries per second (QPS) to track the system’s capacity.
3. Errors
- What it is: The percentage or count of failed requests.
- Why it matters: If error rates spike, users experience broken pages, failed logins, or incomplete transactions.
- How to measure it: Track the percentage of failed requests using HTTP response codes (like 500 errors) or business-specific logic (like failed payments).
4. Saturation
- What it is: The degree to which system resources (CPU, memory, disk) are used.
- Why it matters: Systems close to full capacity become unstable or unresponsive.
- How to measure it: Track CPU utilization, memory usage, disk I/O, and database connection pool saturation.
These four signals provide a comprehensive view of system health, helping SREs pinpoint issues quickly.
3. Tools for Monitoring and Observability
To track the four golden signals and achieve observability, you need the right tools. Here are some of the most popular options:
1. Prometheus
- Use Case: Metrics collection and alerting.
- Why It’s Great: Prometheus scrapes data from services and allows you to set alerts based on defined thresholds.
2. Grafana
- Use Case: Visualization and dashboards.
- Why It’s Great: Works with Prometheus, Datadog, InfluxDB, and more to visualize metrics in real-time.
3. Datadog
- Use Case: Full-stack observability, including metrics, traces, and logs.
- Why It’s Great: One of the most comprehensive platforms for observability, covering metrics, logs, and traces.
4. Jaeger
- Use Case: Distributed tracing.
- Why It’s Great: It’s open-source and provides detailed tracing data, essential for microservices.
4. How to Create Effective Alerting Systems
Alerting is essential for incident response, but too many alerts can overwhelm on-call engineers. Here’s how to create an effective alerting strategy.
1. Prioritize Critical Alerts
Not every alert requires immediate action. Use a severity-based system, like “P1” for critical outages and “P3” for non-urgent issues.
2. Avoid Alert Fatigue
Alert fatigue occurs when engineers are bombarded with low-priority alerts. To avoid it:
- Set alert thresholds carefully (don’t alert on minor blips).
- Use deduplication to avoid multiple alerts for the same issue.
- Send alerts only to the people who can resolve them.
3. Use Contextual Alerts
Alerts without context are confusing. Instead of “CPU usage high,” add details like “CPU usage on db-prod-01 is 95% for 10 mins.”
4. Escalate When Necessary
If a P1 alert goes unanswered, escalate it to a manager or another on-call engineer.
Conclusion: Monitoring, Alerting, and Observability are the Backbone of SRE
With 3 days left until Christmas, it’s essential to make sure your monitoring, alerting, and observability systems are ready to handle holiday traffic. These three pillars give SREs the insight they need to detect, diagnose, and resolve system issues before they become outages.
By tracking the four golden signals (latency, throughput, errors, and saturation), using tools like Prometheus, Grafana, and Datadog, and avoiding alert fatigue with a well-designed alerting strategy, you’ll ensure a reliable, joyful holiday experience for your users.
Tomorrow, we’ll unwrap another critical SRE concept: Capacity Planning and Scaling Systems Reliably. Stay tuned, and may your alerts be actionable, your dashboards clear, and your holidays outage-free.