Back to HomeDevOps

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

13 min min read
#Monitoring#Observability#Prometheus#Grafana#DevOps#DORA Metrics

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

The system goes down, and you only find out when users call to complain—this is the worst-case scenario. A good monitoring system should notify you before users discover problems, allowing you to start addressing issues proactively.

Monitoring is a critical component of DevOps. Without monitoring, no matter how fast your CI/CD is, it's meaningless because you don't know if the system is running properly after deployment. If you're not familiar with CI/CD yet, it's recommended to first read CI/CD Introduction Tutorial. This article will guide you from basic concepts to implementation, building a complete monitoring and observability system.


Monitoring vs Observability

These two terms are often used interchangeably, but there are subtle differences.

Traditional Monitoring

Traditional monitoring is about "known knowns"—you pre-define what to monitor, set alert conditions, and wait for triggers.

Characteristics:

  • Pre-defined monitoring metrics
  • Dashboard displays key data
  • Alerts trigger on anomalies

Limitations:

  • Can only monitor problems you've anticipated
  • Difficult to troubleshoot unknown issues

Observability

Observability is about "unknown unknowns"—the system produces enough data that you can analyze any problem after the fact.

Characteristics:

  • System produces complete Metrics, Logs, Traces
  • Can answer "why did this problem happen"
  • Supports exploratory troubleshooting

Analogy:

  • Monitoring is like a car dashboard, showing only preset information
  • Observability is like an OBD diagnostic system, allowing you to query any data

To understand where monitoring fits in the overall DevOps process, refer to our DevOps Complete Guide.


Three Pillars of Observability

Observability is built on three types of data: Metrics, Logs, and Traces.

Metrics

Metrics are numerical time-series data that represent system state.

Common Metrics Types:

TypeDescriptionExample
CounterOnly increases, never decreasesTotal requests, error count
GaugeCan increase or decreaseCPU usage, memory usage
HistogramValue distributionRequest latency distribution
SummaryPercentilesP99 latency

Metrics Example (Prometheus format):

# Total requests
http_requests_total{method="GET", path="/api/users"} 12345

# Request latency histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250

Logs

Logs are text-based event records that document what happened in the system.

Log Levels:

LevelPurposeExample
DEBUGDevelopment debuggingFunction parameter values
INFOGeneral informationUser login successful
WARNWarning but not affecting operationAPI response slow
ERRORError but system still runningDatabase connection failed (retry successful)
FATALSevere error, system stopsUnable to start service

Structured Log Example (JSON):

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "user-service",
  "trace_id": "abc123",
  "message": "Database connection failed",
  "error": "connection timeout after 30s",
  "db_host": "db.example.com"
}

Traces

Traces record the complete journey of a request, from entering the system to response, through every step.

Trace Structure:

Trace ID: abc123
│
├── Span: API Gateway (50ms)
│   └── Span: User Service (30ms)
│       ├── Span: DB Query (15ms)
│       └── Span: Cache Lookup (5ms)
│
└── Total Duration: 50ms

Value of Traces:

  • Identify where latency occurs across services
  • Understand call relationships between microservices
  • Locate problems in distributed systems

Illustration: Observability Three Pillars Diagram

Scene Description: A three-pillar concept diagram, with Metrics on the left (represented by chart lines), Logs in the middle (represented by text lines), and Traces on the right (represented by connected nodes). All three connect to an "Observability" title above, with their respective purposes labeled below.

Visual Focus:

  • Main content clearly presented

Required Elements:

  • Based on key elements in description

Chinese Text to Display: None

Color Tone: Professional, clear

Elements to Avoid: Abstract graphics, gears, glowing effects

Slug: observability-three-pillars-diagram


What Should DevOps Monitor?

Four Golden Signals

Four of the most important monitoring metrics proposed by Google SRE:

MetricDescriptionHealth Standard
LatencyRequest latencyP99 < expected value
TrafficTraffic / QPSWithin normal range
ErrorsError rate< 1%
SaturationResource saturation< 80%

USE Method (Resource Monitoring)

Resource monitoring method proposed by Brendan Gregg:

AspectWhat to Monitor
UtilizationResource usage rate (CPU, Memory, Disk)
SaturationQueued workload
ErrorsNumber of error events

RED Method (Service Monitoring)

Method suitable for monitoring microservices:

AspectWhat to Monitor
RateRequests per second
ErrorsErrors per second
DurationRequest latency distribution

DORA Metrics

Four key metrics defined by DORA (DevOps Research and Assessment):

MetricDescriptionHigh-Performing Team Standard
Deployment FrequencyHow often to deploy to productionMultiple times per day
Lead Time for ChangesTime from commit to deploymentLess than one day
Time to Restore ServiceTime from failure to recoveryLess than one hour
Change Failure RatePercentage of deployments causing issues0-15%

These four metrics allow you to quantify the effectiveness of DevOps practices.

Not sure which metrics to monitor? Book an architecture consultation, let us help you design a complete monitoring strategy.


Monitoring Tool Introduction

Tool Landscape

CategoryToolFeatures
MetricsPrometheusOpen source standard, Pull-based
MetricsInfluxDBTime-series database
VisualizationGrafanaBest visualization
LogsElasticsearchFull-text search
LogsLokiLightweight, Grafana integration
TracesJaegerUber open source, CNCF project
TracesZipkinTwitter open source
APMDatadogComprehensive commercial solution
APMNew RelicComprehensive commercial solution

Prometheus + Grafana Implementation

This is the most popular open-source monitoring combination.

Architecture Overview

┌─────────────┐     ┌─────────────┐
│   App +     │────>│ Prometheus  │
│  Exporter   │     │  (Metrics)  │
└─────────────┘     └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │   Grafana   │
                    │ (Dashboard) │
                    └─────────────┘

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['app:8080']

Adding Metrics to Applications

Using Node.js as an example:

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create Metrics Registry
const register = new promClient.Registry();

// Default Metrics
promClient.collectDefaultMetrics({ register });

// Custom Metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register]
});

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10],
  registers: [register]
});

// Middleware to record Metrics
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    path: req.path
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      path: req.path,
      status: res.statusCode
    });
    end();
  });

  next();
});

// Metrics Endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

PromQL Query Examples

# Requests per second
rate(http_requests_total[5m])

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Illustration: Prometheus + Grafana Architecture Diagram

Scene Description: A system architecture diagram showing multiple applications (labeled Exporter) connecting to Prometheus Server, Prometheus connecting to Alertmanager and Grafana, with users viewing Dashboards through Grafana. Data flow arrows are marked.

Visual Focus:

  • Main content clearly presented

Required Elements:

  • Based on key elements in description

Chinese Text to Display: None

Color Tone: Professional, clear

Elements to Avoid: Abstract graphics, gears, glowing effects

Slug: prometheus-grafana-architecture


Log Management Tools

ELK Stack vs Loki

AspectELK StackLoki
Full NameElasticsearch + Logstash + KibanaGrafana Loki
ArchitectureFull-text indexingIndex only Labels
Resource RequirementsHighLow
Query CapabilityPowerful complex queriesSimpler
Learning CurveSteepGentle
Suitable ScenarioNeed complex searchGrafana ecosystem

Loki Configuration Example

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

LogQL Query Examples

# Query error logs for specific service
{service="user-api"} |= "error"

# Using regular expressions
{service="user-api"} |~ "status=[45].."

# Count errors per minute
count_over_time({service="user-api"} |= "error" [1m])

Distributed Tracing

In microservices architectures, a single request may pass through multiple services. Distributed tracing helps you understand the complete path of a request.

OpenTelemetry

OpenTelemetry is a CNCF project that provides a unified standard for collecting Metrics, Logs, and Traces.

Node.js Integration Example:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Jaeger Architecture

┌──────────────┐     ┌──────────────┐
│    App 1     │────>│   Jaeger     │
└──────────────┘     │   Collector  │
                     └───────┬──────┘
┌──────────────┐             │
│    App 2     │────────────>│
└──────────────┘             │
                             ▼
                     ┌──────────────┐
                     │   Storage    │
                     └───────┬──────┘
                             │
                             ▼
                     ┌──────────────┐
                     │   Jaeger UI  │
                     └──────────────┘

For more tool selection and comparisons, refer to DevOps Tools Complete Guide.


Cloud Monitoring Services

Mainstream Cloud Monitoring

CloudService NameFeatures
AWSCloudWatchDeep AWS service integration
GCPCloud MonitoringFormerly Stackdriver
AzureAzure MonitorApplication Insights integration

Comprehensive APM Services

ServiceFeaturesSuitable Team
DatadogMost comprehensive features, widest integrationMedium to large teams
New RelicAPM origins, AI alertingNeed deep APM
DynatraceAI-driven, automationEnterprise requirements

Alerting Design Best Practices

Poorly designed alerting leads to "alert fatigue"—too many meaningless alerts cause teams to become numb.

Alerting Design Principles

1. Only Alert on Actionable Events

Every alert should have clear handling steps. If nothing needs to be done after receiving an alert, it shouldn't be an alert.

2. Tiered Alerting

LevelConditionNotification Method
CriticalAffects users, needs immediate actionPhone + Slack
WarningMay deteriorate, needs attentionSlack
InfoFor recording onlyDashboard

3. Set Reasonable Thresholds

# Prometheus Alert Rules
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is {{ $value | humanizeDuration }}"

4. Avoid Alert Fatigue

  • Aggregate similar alerts
  • Set silence periods
  • Regularly review and adjust alerts

Monitoring Maturity Model

Five Stages

StageCharacteristicsCapability
Level 1Reactive monitoringOnly know about problems from user reports
Level 2Basic monitoringHave dashboards, basic alerts
Level 3Proactive monitoringComplete Metrics, Logs, tiered alerts
Level 4ObservabilityTraces, Root Cause Analysis
Level 5PredictiveAIOps, anomaly prediction, auto-remediation

Self-Assessment

  • Have dashboards showing system status
  • Have alerting mechanism, not relying on user reports
  • Alerts are tiered, Critical no more than 5 times per week
  • Can trace the complete path of a request
  • Have DORA Metrics measurement
  • Can locate root cause within 30 minutes

To learn more about how SRE applies these monitoring practices, refer to DevOps vs SRE Comparison. For the complete DevOps skill path, refer to DevOps Learning Roadmap.

Illustration: Monitoring Maturity Model Diagram

Scene Description: A staircase-style maturity model diagram, from bottom-left to top-right showing five stages: Reactive Monitoring, Basic Monitoring, Proactive Monitoring, Observability, Predictive. Each stage has key capabilities labeled, with different colors distinguishing maturity levels above the stairs.

Visual Focus:

  • Main content clearly presented

Required Elements:

  • Based on key elements in description

Chinese Text to Display: None

Color Tone: Professional, clear

Elements to Avoid: Abstract graphics, gears, glowing effects

Slug: monitoring-maturity-model-levels


FAQ

Does the Monitoring System Itself Need Monitoring?

Yes. It's recommended to use a different service to monitor your monitoring system, or use external monitoring services (like UptimeRobot) to monitor whether internal services are alive.

Is Prometheus Suitable for Long-Term Storage?

Prometheus defaults to retaining data for only 15 days. For long-term storage, solutions like Thanos or Cortex are recommended.

How Long Should Logs Be Retained?

Depends on compliance requirements. General recommendations:

  • Hot storage: 7-30 days
  • Warm storage: 90 days
  • Cold storage: 1-7 years (depending on regulations)

Self-Hosted vs SaaS?

AspectSelf-HostedSaaS
CostLower upfront, high maintenance costSubscription-based, predictable
FlexibilityHighly customizableFixed features
MaintenanceRequires dedicated staffNo maintenance needed
RecommendationLarge teams, compliance needsSmall teams, quick setup

How to Handle Alert Storms?

  • Set up alert aggregation
  • Use inhibition rules to suppress related alerts
  • Set silence periods to avoid repeated notifications
  • Review afterward if alert design is reasonable

Conclusion

Monitoring is an indispensable part of DevOps practice. A good monitoring system allows you to:

  1. Detect problems early: Know about anomalies before users do
  2. Quickly locate root causes: Complete tracing from Metrics to Logs to Traces
  3. Quantify improvement effectiveness: Use DORA Metrics to measure DevOps maturity
  4. Build confidence: Have a clear understanding of system status

Recommended Implementation Order:

  1. First establish basic Metrics (Prometheus)
  2. Add visualization (Grafana)
  3. Design alerting mechanism
  4. Integrate log management
  5. Implement distributed tracing

Monitoring is not a one-time project but a continuous improvement process. Start with the most painful problems and gradually improve monitoring capabilities.

Want to build a complete Observability system but don't know where to start? Book an architecture consultation, let us help you design a monitoring architecture, from Metrics to Traces all at once.

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

Related Articles