DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]
![DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]](/images/blog/devops/devops-monitoring-guide-hero.webp)
DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]
The system goes down, and you only find out when users call to complain—this is the worst-case scenario. A good monitoring system should notify you before users discover problems, allowing you to start addressing issues proactively.
Monitoring is a critical component of DevOps. Without monitoring, no matter how fast your CI/CD is, it's meaningless because you don't know if the system is running properly after deployment. If you're not familiar with CI/CD yet, it's recommended to first read CI/CD Introduction Tutorial. This article will guide you from basic concepts to implementation, building a complete monitoring and observability system.
Monitoring vs Observability
These two terms are often used interchangeably, but there are subtle differences.
Traditional Monitoring
Traditional monitoring is about "known knowns"—you pre-define what to monitor, set alert conditions, and wait for triggers.
Characteristics:
- Pre-defined monitoring metrics
- Dashboard displays key data
- Alerts trigger on anomalies
Limitations:
- Can only monitor problems you've anticipated
- Difficult to troubleshoot unknown issues
Observability
Observability is about "unknown unknowns"—the system produces enough data that you can analyze any problem after the fact.
Characteristics:
- System produces complete Metrics, Logs, Traces
- Can answer "why did this problem happen"
- Supports exploratory troubleshooting
Analogy:
- Monitoring is like a car dashboard, showing only preset information
- Observability is like an OBD diagnostic system, allowing you to query any data
To understand where monitoring fits in the overall DevOps process, refer to our DevOps Complete Guide.
Three Pillars of Observability
Observability is built on three types of data: Metrics, Logs, and Traces.
Metrics
Metrics are numerical time-series data that represent system state.
Common Metrics Types:
| Type | Description | Example |
|---|---|---|
| Counter | Only increases, never decreases | Total requests, error count |
| Gauge | Can increase or decrease | CPU usage, memory usage |
| Histogram | Value distribution | Request latency distribution |
| Summary | Percentiles | P99 latency |
Metrics Example (Prometheus format):
# Total requests
http_requests_total{method="GET", path="/api/users"} 12345
# Request latency histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250
Logs
Logs are text-based event records that document what happened in the system.
Log Levels:
| Level | Purpose | Example |
|---|---|---|
| DEBUG | Development debugging | Function parameter values |
| INFO | General information | User login successful |
| WARN | Warning but not affecting operation | API response slow |
| ERROR | Error but system still running | Database connection failed (retry successful) |
| FATAL | Severe error, system stops | Unable to start service |
Structured Log Example (JSON):
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "ERROR",
"service": "user-service",
"trace_id": "abc123",
"message": "Database connection failed",
"error": "connection timeout after 30s",
"db_host": "db.example.com"
}
Traces
Traces record the complete journey of a request, from entering the system to response, through every step.
Trace Structure:
Trace ID: abc123
│
├── Span: API Gateway (50ms)
│ └── Span: User Service (30ms)
│ ├── Span: DB Query (15ms)
│ └── Span: Cache Lookup (5ms)
│
└── Total Duration: 50ms
Value of Traces:
- Identify where latency occurs across services
- Understand call relationships between microservices
- Locate problems in distributed systems
Illustration: Observability Three Pillars Diagram
Scene Description: A three-pillar concept diagram, with Metrics on the left (represented by chart lines), Logs in the middle (represented by text lines), and Traces on the right (represented by connected nodes). All three connect to an "Observability" title above, with their respective purposes labeled below.
Visual Focus:
- Main content clearly presented
Required Elements:
- Based on key elements in description
Chinese Text to Display: None
Color Tone: Professional, clear
Elements to Avoid: Abstract graphics, gears, glowing effects
Slug:
observability-three-pillars-diagram
What Should DevOps Monitor?
Four Golden Signals
Four of the most important monitoring metrics proposed by Google SRE:
| Metric | Description | Health Standard |
|---|---|---|
| Latency | Request latency | P99 < expected value |
| Traffic | Traffic / QPS | Within normal range |
| Errors | Error rate | < 1% |
| Saturation | Resource saturation | < 80% |
USE Method (Resource Monitoring)
Resource monitoring method proposed by Brendan Gregg:
| Aspect | What to Monitor |
|---|---|
| Utilization | Resource usage rate (CPU, Memory, Disk) |
| Saturation | Queued workload |
| Errors | Number of error events |
RED Method (Service Monitoring)
Method suitable for monitoring microservices:
| Aspect | What to Monitor |
|---|---|
| Rate | Requests per second |
| Errors | Errors per second |
| Duration | Request latency distribution |
DORA Metrics
Four key metrics defined by DORA (DevOps Research and Assessment):
| Metric | Description | High-Performing Team Standard |
|---|---|---|
| Deployment Frequency | How often to deploy to production | Multiple times per day |
| Lead Time for Changes | Time from commit to deployment | Less than one day |
| Time to Restore Service | Time from failure to recovery | Less than one hour |
| Change Failure Rate | Percentage of deployments causing issues | 0-15% |
These four metrics allow you to quantify the effectiveness of DevOps practices.
Not sure which metrics to monitor? Book an architecture consultation, let us help you design a complete monitoring strategy.
Monitoring Tool Introduction
Tool Landscape
| Category | Tool | Features |
|---|---|---|
| Metrics | Prometheus | Open source standard, Pull-based |
| Metrics | InfluxDB | Time-series database |
| Visualization | Grafana | Best visualization |
| Logs | Elasticsearch | Full-text search |
| Logs | Loki | Lightweight, Grafana integration |
| Traces | Jaeger | Uber open source, CNCF project |
| Traces | Zipkin | Twitter open source |
| APM | Datadog | Comprehensive commercial solution |
| APM | New Relic | Comprehensive commercial solution |
Prometheus + Grafana Implementation
This is the most popular open-source monitoring combination.
Architecture Overview
┌─────────────┐ ┌─────────────┐
│ App + │────>│ Prometheus │
│ Exporter │ │ (Metrics) │
└─────────────┘ └──────┬──────┘
│
▼
┌─────────────┐
│ Grafana │
│ (Dashboard) │
└─────────────┘
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['app:8080']
Adding Metrics to Applications
Using Node.js as an example:
const express = require('express');
const promClient = require('prom-client');
const app = express();
// Create Metrics Registry
const register = new promClient.Registry();
// Default Metrics
promClient.collectDefaultMetrics({ register });
// Custom Metrics
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [register]
});
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10],
registers: [register]
});
// Middleware to record Metrics
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({
method: req.method,
path: req.path
});
res.on('finish', () => {
httpRequestsTotal.inc({
method: req.method,
path: req.path,
status: res.statusCode
});
end();
});
next();
});
// Metrics Endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
PromQL Query Examples
# Requests per second
rate(http_requests_total[5m])
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Illustration: Prometheus + Grafana Architecture Diagram
Scene Description: A system architecture diagram showing multiple applications (labeled Exporter) connecting to Prometheus Server, Prometheus connecting to Alertmanager and Grafana, with users viewing Dashboards through Grafana. Data flow arrows are marked.
Visual Focus:
- Main content clearly presented
Required Elements:
- Based on key elements in description
Chinese Text to Display: None
Color Tone: Professional, clear
Elements to Avoid: Abstract graphics, gears, glowing effects
Slug:
prometheus-grafana-architecture
Log Management Tools
ELK Stack vs Loki
| Aspect | ELK Stack | Loki |
|---|---|---|
| Full Name | Elasticsearch + Logstash + Kibana | Grafana Loki |
| Architecture | Full-text indexing | Index only Labels |
| Resource Requirements | High | Low |
| Query Capability | Powerful complex queries | Simpler |
| Learning Curve | Steep | Gentle |
| Suitable Scenario | Need complex search | Grafana ecosystem |
Loki Configuration Example
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
LogQL Query Examples
# Query error logs for specific service
{service="user-api"} |= "error"
# Using regular expressions
{service="user-api"} |~ "status=[45].."
# Count errors per minute
count_over_time({service="user-api"} |= "error" [1m])
Distributed Tracing
In microservices architectures, a single request may pass through multiple services. Distributed tracing helps you understand the complete path of a request.
OpenTelemetry
OpenTelemetry is a CNCF project that provides a unified standard for collecting Metrics, Logs, and Traces.
Node.js Integration Example:
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Jaeger Architecture
┌──────────────┐ ┌──────────────┐
│ App 1 │────>│ Jaeger │
└──────────────┘ │ Collector │
└───────┬──────┘
┌──────────────┐ │
│ App 2 │────────────>│
└──────────────┘ │
▼
┌──────────────┐
│ Storage │
└───────┬──────┘
│
▼
┌──────────────┐
│ Jaeger UI │
└──────────────┘
For more tool selection and comparisons, refer to DevOps Tools Complete Guide.
Cloud Monitoring Services
Mainstream Cloud Monitoring
| Cloud | Service Name | Features |
|---|---|---|
| AWS | CloudWatch | Deep AWS service integration |
| GCP | Cloud Monitoring | Formerly Stackdriver |
| Azure | Azure Monitor | Application Insights integration |
Comprehensive APM Services
| Service | Features | Suitable Team |
|---|---|---|
| Datadog | Most comprehensive features, widest integration | Medium to large teams |
| New Relic | APM origins, AI alerting | Need deep APM |
| Dynatrace | AI-driven, automation | Enterprise requirements |
Alerting Design Best Practices
Poorly designed alerting leads to "alert fatigue"—too many meaningless alerts cause teams to become numb.
Alerting Design Principles
1. Only Alert on Actionable Events
Every alert should have clear handling steps. If nothing needs to be done after receiving an alert, it shouldn't be an alert.
2. Tiered Alerting
| Level | Condition | Notification Method |
|---|---|---|
| Critical | Affects users, needs immediate action | Phone + Slack |
| Warning | May deteriorate, needs attention | Slack |
| Info | For recording only | Dashboard |
3. Set Reasonable Thresholds
# Prometheus Alert Rules
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is {{ $value | humanizeDuration }}"
4. Avoid Alert Fatigue
- Aggregate similar alerts
- Set silence periods
- Regularly review and adjust alerts
Monitoring Maturity Model
Five Stages
| Stage | Characteristics | Capability |
|---|---|---|
| Level 1 | Reactive monitoring | Only know about problems from user reports |
| Level 2 | Basic monitoring | Have dashboards, basic alerts |
| Level 3 | Proactive monitoring | Complete Metrics, Logs, tiered alerts |
| Level 4 | Observability | Traces, Root Cause Analysis |
| Level 5 | Predictive | AIOps, anomaly prediction, auto-remediation |
Self-Assessment
- Have dashboards showing system status
- Have alerting mechanism, not relying on user reports
- Alerts are tiered, Critical no more than 5 times per week
- Can trace the complete path of a request
- Have DORA Metrics measurement
- Can locate root cause within 30 minutes
To learn more about how SRE applies these monitoring practices, refer to DevOps vs SRE Comparison. For the complete DevOps skill path, refer to DevOps Learning Roadmap.
Illustration: Monitoring Maturity Model Diagram
Scene Description: A staircase-style maturity model diagram, from bottom-left to top-right showing five stages: Reactive Monitoring, Basic Monitoring, Proactive Monitoring, Observability, Predictive. Each stage has key capabilities labeled, with different colors distinguishing maturity levels above the stairs.
Visual Focus:
- Main content clearly presented
Required Elements:
- Based on key elements in description
Chinese Text to Display: None
Color Tone: Professional, clear
Elements to Avoid: Abstract graphics, gears, glowing effects
Slug:
monitoring-maturity-model-levels
FAQ
Does the Monitoring System Itself Need Monitoring?
Yes. It's recommended to use a different service to monitor your monitoring system, or use external monitoring services (like UptimeRobot) to monitor whether internal services are alive.
Is Prometheus Suitable for Long-Term Storage?
Prometheus defaults to retaining data for only 15 days. For long-term storage, solutions like Thanos or Cortex are recommended.
How Long Should Logs Be Retained?
Depends on compliance requirements. General recommendations:
- Hot storage: 7-30 days
- Warm storage: 90 days
- Cold storage: 1-7 years (depending on regulations)
Self-Hosted vs SaaS?
| Aspect | Self-Hosted | SaaS |
|---|---|---|
| Cost | Lower upfront, high maintenance cost | Subscription-based, predictable |
| Flexibility | Highly customizable | Fixed features |
| Maintenance | Requires dedicated staff | No maintenance needed |
| Recommendation | Large teams, compliance needs | Small teams, quick setup |
How to Handle Alert Storms?
- Set up alert aggregation
- Use inhibition rules to suppress related alerts
- Set silence periods to avoid repeated notifications
- Review afterward if alert design is reasonable
Conclusion
Monitoring is an indispensable part of DevOps practice. A good monitoring system allows you to:
- Detect problems early: Know about anomalies before users do
- Quickly locate root causes: Complete tracing from Metrics to Logs to Traces
- Quantify improvement effectiveness: Use DORA Metrics to measure DevOps maturity
- Build confidence: Have a clear understanding of system status
Recommended Implementation Order:
- First establish basic Metrics (Prometheus)
- Add visualization (Grafana)
- Design alerting mechanism
- Integrate log management
- Implement distributed tracing
Monitoring is not a one-time project but a continuous improvement process. Start with the most painful problems and gradually improve monitoring capabilities.
Want to build a complete Observability system but don't know where to start? Book an architecture consultation, let us help you design a monitoring architecture, from Metrics to Traces all at once.
Need Professional Cloud Advice?
Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help
Book Free ConsultationRelated Articles
What DevOps Tools Are There? 2025 Popular Tools Categorized with Selection Recommendations
2025 complete DevOps tools guide! From version control, CI/CD, containerization to monitoring, systematically introducing pros, cons, and applicable scenarios of various tool categories. Help you build the DevOps toolchain that best fits your team.
DevOpsWhat is CI/CD? Continuous Integration and Continuous Deployment Beginner Tutorial [2025]
What is CI/CD? Complete analysis of Continuous Integration (CI) and Continuous Deployment (CD) concepts and implementation methods. Covers Pipeline design, tool comparisons, best practices, and practical examples to help your team build efficient automated delivery workflows.
DevOpsDevOps vs SRE vs DevSecOps: Comparison and Career Selection Guide [2025]
What's the difference between DevOps, SRE, and DevSecOps? Complete comparison of the three's core philosophies, job responsibilities, skill requirements, and applicable scenarios. Help you understand the differences between these roles and make the right career choice.