| Aspect | Self-Hosted | SaaS | |--------|-------------|------| | Cost | Lower upfront, high maintenance cost | Subscription-based, predictable | | Flexibility | Highly customizable | Fixed features | | Maintenance | Requires dedicated staff | No maintenance needed | | Recommendation | Large teams, compliance needs | Small teams, quick setup |

Back to HomeDevOps

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

Q: Does the Monitoring System Itself Need Monitoring?

Yes. It's recommended to use a different service to monitor your monitoring system, or use external monitoring services (like UptimeRobot) to monitor whether internal services are alive.

Q: Is Prometheus Suitable for Long-Term Storage?

Prometheus defaults to retaining data for only 15 days. For long-term storage, solutions like Thanos or Cortex are recommended.

Q: How Long Should Logs Be Retained?

Depends on compliance requirements. General recommendations: Hot storage: 7-30 days Warm storage: 90 days Cold storage: 1-7 years (depending on regulations)

Q: How to Handle Alert Storms?

Set up alert aggregation Use inhibition rules to suppress related alerts Set silence periods to avoid repeated notifications Review afterward if alert design is reasonable ---

2/4/202612 min min read

#Monitoring#Observability#Prometheus#Grafana#DevOps#DORA Metrics

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

The system goes down, and you only find out when users call to complain—this is the worst-case scenario. A good monitoring system should notify you before users discover problems, allowing you to start addressing issues proactively.

Monitoring is a critical component of DevOps. Without monitoring, no matter how fast your CI/CD is, it's meaningless because you don't know if the system is running properly after deployment. If you're not familiar with CI/CD yet, it's recommended to first read CI/CD Introduction Tutorial. This article will guide you from basic concepts to implementation, building a complete monitoring and observability system.

Monitoring vs Observability

These two terms are often used interchangeably, but there are subtle differences.

Traditional Monitoring

Traditional monitoring is about "known knowns"—you pre-define what to monitor, set alert conditions, and wait for triggers.

Characteristics:

Pre-defined monitoring metrics
Dashboard displays key data
Alerts trigger on anomalies

Limitations:

Can only monitor problems you've anticipated
Difficult to troubleshoot unknown issues

Observability

Observability is about "unknown unknowns"—the system produces enough data that you can analyze any problem after the fact.

Characteristics:

System produces complete Metrics, Logs, Traces
Can answer "why did this problem happen"
Supports exploratory troubleshooting

Analogy:

Monitoring is like a car dashboard, showing only preset information
Observability is like an OBD diagnostic system, allowing you to query any data

To understand where monitoring fits in the overall DevOps process, refer to our DevOps Complete Guide.

Three Pillars of Observability

Observability is built on three types of data: Metrics, Logs, and Traces.

Metrics

Metrics are numerical time-series data that represent system state.

Common Metrics Types:

Type	Description	Example
Counter	Only increases, never decreases	Total requests, error count
Gauge	Can increase or decrease	CPU usage, memory usage
Histogram	Value distribution	Request latency distribution
Summary	Percentiles	P99 latency

Metrics Example (Prometheus format):

# Total requests
http_requests_total{method="GET", path="/api/users"} 12345

# Request latency histogram
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250

Logs

Logs are text-based event records that document what happened in the system.

Log Levels:

Level	Purpose	Example
DEBUG	Development debugging	Function parameter values
INFO	General information	User login successful
WARN	Warning but not affecting operation	API response slow
ERROR	Error but system still running	Database connection failed (retry successful)
FATAL	Severe error, system stops	Unable to start service

Structured Log Example (JSON):

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "user-service",
  "trace_id": "abc123",
  "message": "Database connection failed",
  "error": "connection timeout after 30s",
  "db_host": "db.example.com"
}

Traces

Traces record the complete journey of a request, from entering the system to response, through every step.

Trace Structure:

Trace ID: abc123
│
├── Span: API Gateway (50ms)
│   └── Span: User Service (30ms)
│       ├── Span: DB Query (15ms)
│       └── Span: Cache Lookup (5ms)
│
└── Total Duration: 50ms

Value of Traces:

Identify where latency occurs across services
Understand call relationships between microservices
Locate problems in distributed systems

What Should DevOps Monitor?

Four Golden Signals

Four of the most important monitoring metrics proposed by Google SRE:

Metric	Description	Health Standard
Latency	Request latency	P99 < expected value
Traffic	Traffic / QPS	Within normal range
Errors	Error rate	< 1%
Saturation	Resource saturation	< 80%

USE Method (Resource Monitoring)

Resource monitoring method proposed by Brendan Gregg:

Aspect	What to Monitor
Utilization	Resource usage rate (CPU, Memory, Disk)
Saturation	Queued workload
Errors	Number of error events

RED Method (Service Monitoring)

Method suitable for monitoring microservices:

Aspect	What to Monitor
Rate	Requests per second
Errors	Errors per second
Duration	Request latency distribution

DORA Metrics

Four key metrics defined by DORA (DevOps Research and Assessment):

Metric	Description	High-Performing Team Standard
Deployment Frequency	How often to deploy to production	Multiple times per day
Lead Time for Changes	Time from commit to deployment	Less than one day
Time to Restore Service	Time from failure to recovery	Less than one hour
Change Failure Rate	Percentage of deployments causing issues	0-15%

These four metrics allow you to quantify the effectiveness of DevOps practices.

Not sure which metrics to monitor? Book an architecture consultation, let us help you design a complete monitoring strategy.

Monitoring Tool Introduction

Tool Landscape

Category	Tool	Features
Metrics	Prometheus	Open source standard, Pull-based
Metrics	InfluxDB	Time-series database
Visualization	Grafana	Best visualization
Logs	Elasticsearch	Full-text search
Logs	Loki	Lightweight, Grafana integration
Traces	Jaeger	Uber open source, CNCF project
Traces	Zipkin	Twitter open source
APM	Datadog	Comprehensive commercial solution
APM	New Relic	Comprehensive commercial solution

Prometheus + Grafana Implementation

This is the most popular open-source monitoring combination.

Architecture Overview

┌─────────────┐     ┌─────────────┐
│   App +     │────>│ Prometheus  │
│  Exporter   │     │  (Metrics)  │
└─────────────┘     └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │   Grafana   │
                    │ (Dashboard) │
                    └─────────────┘

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['app:8080']

Adding Metrics to Applications

Using Node.js as an example:

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create Metrics Registry
const register = new promClient.Registry();

// Default Metrics
promClient.collectDefaultMetrics({ register });

// Custom Metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register]
});

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10],
  registers: [register]
});

// Middleware to record Metrics
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    path: req.path
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      path: req.path,
      status: res.statusCode
    });
    end();
  });

  next();
});

// Metrics Endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

PromQL Query Examples

# Requests per second
rate(http_requests_total[5m])

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Log Management Tools

ELK Stack vs Loki

Aspect	ELK Stack	Loki
Full Name	Elasticsearch + Logstash + Kibana	Grafana Loki
Architecture	Full-text indexing	Index only Labels
Resource Requirements	High	Low
Query Capability	Powerful complex queries	Simpler
Learning Curve	Steep	Gentle
Suitable Scenario	Need complex search	Grafana ecosystem

Loki Configuration Example

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

LogQL Query Examples

# Query error logs for specific service
{service="user-api"} |= "error"

# Using regular expressions
{service="user-api"} |~ "status=[45].."

# Count errors per minute
count_over_time({service="user-api"} |= "error" [1m])

Distributed Tracing

In microservices architectures, a single request may pass through multiple services. Distributed tracing helps you understand the complete path of a request.

OpenTelemetry

OpenTelemetry is a CNCF project that provides a unified standard for collecting Metrics, Logs, and Traces.

Node.js Integration Example:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Jaeger Architecture

┌──────────────┐     ┌──────────────┐
│    App 1     │────>│   Jaeger     │
└──────────────┘     │   Collector  │
                     └───────┬──────┘
┌──────────────┐             │
│    App 2     │────────────>│
└──────────────┘             │
                             ▼
                     ┌──────────────┐
                     │   Storage    │
                     └───────┬──────┘
                             │
                             ▼
                     ┌──────────────┐
                     │   Jaeger UI  │
                     └──────────────┘

For more tool selection and comparisons, refer to DevOps Tools Complete Guide.

Cloud Monitoring Services

Mainstream Cloud Monitoring

Cloud	Service Name	Features
AWS	CloudWatch	Deep AWS service integration
GCP	Cloud Monitoring	Formerly Stackdriver
Azure	Azure Monitor	Application Insights integration

Comprehensive APM Services

Service	Features	Suitable Team
Datadog	Most comprehensive features, widest integration	Medium to large teams
New Relic	APM origins, AI alerting	Need deep APM
Dynatrace	AI-driven, automation	Enterprise requirements

Alerting Design Best Practices

Poorly designed alerting leads to "alert fatigue"—too many meaningless alerts cause teams to become numb.

Alerting Design Principles

1. Only Alert on Actionable Events

Every alert should have clear handling steps. If nothing needs to be done after receiving an alert, it shouldn't be an alert.

2. Tiered Alerting

Level	Condition	Notification Method
Critical	Affects users, needs immediate action	Phone + Slack
Warning	May deteriorate, needs attention	Slack
Info	For recording only	Dashboard

3. Set Reasonable Thresholds

# Prometheus Alert Rules
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is {{ $value | humanizeDuration }}"

4. Avoid Alert Fatigue

Aggregate similar alerts
Set silence periods
Regularly review and adjust alerts

Monitoring Maturity Model

Five Stages

Stage	Characteristics	Capability
Level 1	Reactive monitoring	Only know about problems from user reports
Level 2	Basic monitoring	Have dashboards, basic alerts
Level 3	Proactive monitoring	Complete Metrics, Logs, tiered alerts
Level 4	Observability	Traces, Root Cause Analysis
Level 5	Predictive	AIOps, anomaly prediction, auto-remediation

Self-Assessment

Have dashboards showing system status
Have alerting mechanism, not relying on user reports
Alerts are tiered, Critical no more than 5 times per week
Can trace the complete path of a request
Have DORA Metrics measurement
Can locate root cause within 30 minutes

To learn more about how SRE applies these monitoring practices, refer to DevOps vs SRE Comparison. For the complete DevOps skill path, refer to DevOps Learning Roadmap.

FAQ

Does the Monitoring System Itself Need Monitoring?

Yes. It's recommended to use a different service to monitor your monitoring system, or use external monitoring services (like UptimeRobot) to monitor whether internal services are alive.

Is Prometheus Suitable for Long-Term Storage?

Prometheus defaults to retaining data for only 15 days. For long-term storage, solutions like Thanos or Cortex are recommended.

How Long Should Logs Be Retained?

Depends on compliance requirements. General recommendations:

Hot storage: 7-30 days
Warm storage: 90 days
Cold storage: 1-7 years (depending on regulations)

Self-Hosted vs SaaS?

Aspect	Self-Hosted	SaaS
Cost	Lower upfront, high maintenance cost	Subscription-based, predictable
Flexibility	Highly customizable	Fixed features
Maintenance	Requires dedicated staff	No maintenance needed
Recommendation	Large teams, compliance needs	Small teams, quick setup

How to Handle Alert Storms?

Set up alert aggregation
Use inhibition rules to suppress related alerts
Set silence periods to avoid repeated notifications
Review afterward if alert design is reasonable

Conclusion

Monitoring is an indispensable part of DevOps practice. A good monitoring system allows you to:

Detect problems early: Know about anomalies before users do
Quickly locate root causes: Complete tracing from Metrics to Logs to Traces
Quantify improvement effectiveness: Use DORA Metrics to measure DevOps maturity
Build confidence: Have a clear understanding of system status

Recommended Implementation Order:

First establish basic Metrics (Prometheus)
Add visualization (Grafana)
Design alerting mechanism
Integrate log management
Implement distributed tracing

Monitoring is not a one-time project but a continuous improvement process. Start with the most painful problems and gradually improve monitoring capabilities.

Want to build a complete Observability system but don't know where to start? Book an architecture consultation, let us help you design a monitoring architecture, from Metrics to Traces all at once.

Need Professional Cloud Advice?

Whether you're evaluating cloud platforms, optimizing existing architecture, or looking for cost-saving solutions, we can help

Book Free Consultation

DevOps

What DevOps Tools Are There? 2025 Popular Tools Categorized with Selection Recommendations

2025 complete DevOps tools guide! From version control, CI/CD, containerization to monitoring, systematically introducing pros, cons, and applicable scenarios of various tool categories. Help you build the DevOps toolchain that best fits your team.

DevOps

What is CI/CD? Continuous Integration and Continuous Deployment Beginner Tutorial [2025]

What is CI/CD? Complete analysis of Continuous Integration (CI) and Continuous Deployment (CD) concepts and implementation methods. Covers Pipeline design, tool comparisons, best practices, and practical examples to help your team build efficient automated delivery workflows.

DevOps

DevOps vs SRE vs DevSecOps: Comparison and Career Selection Guide [2025]

What's the difference between DevOps, SRE, and DevSecOps? Complete comparison of the three's core philosophies, job responsibilities, skill requirements, and applicable scenarios. Help you understand the differences between these roles and make the right career choice.

DevOps Monitoring Guide: Observability and Monitoring Tools Implementation [2025]

Monitoring vs Observability

Traditional Monitoring

Observability

Three Pillars of Observability

Metrics

Logs

Traces

What Should DevOps Monitor?

Four Golden Signals

USE Method (Resource Monitoring)

RED Method (Service Monitoring)

DORA Metrics

Monitoring Tool Introduction

Tool Landscape

Prometheus + Grafana Implementation

Architecture Overview

Prometheus Configuration

Adding Metrics to Applications

PromQL Query Examples

Log Management Tools

ELK Stack vs Loki

Loki Configuration Example

LogQL Query Examples

Distributed Tracing

OpenTelemetry

Jaeger Architecture

Cloud Monitoring Services

Mainstream Cloud Monitoring

Comprehensive APM Services

Alerting Design Best Practices

Alerting Design Principles

Monitoring Maturity Model

Five Stages

Self-Assessment

FAQ

Does the Monitoring System Itself Need Monitoring?

Is Prometheus Suitable for Long-Term Storage?

How Long Should Logs Be Retained?

Self-Hosted vs SaaS?

How to Handle Alert Storms?

Conclusion

Need Professional Cloud Advice?

Related Articles

What DevOps Tools Are There? 2025 Popular Tools Categorized with Selection Recommendations

What is CI/CD? Continuous Integration and Continuous Deployment Beginner Tutorial [2025]

DevOps vs SRE vs DevSecOps: Comparison and Career Selection Guide [2025]