Knowledge-Base/20 - Knowledge/infrastructure/monitoring-and-observability.md

---
title: Monitoring and Observability
description: Core concepts behind monitoring, alerting, and observability for self-hosted systems
tags:
  - monitoring
  - observability
  - operations
category: infrastructure
created: 2026-03-14
updated: 2026-03-14
---

# Monitoring and Observability

## Summary

Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally.

## Why it matters

Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation.

## Core concepts

- Metrics: numerical measurements over time
- Logs: event records produced by systems and applications
- Traces: request-path visibility across components
- Alerting: notifications triggered by actionable failure conditions
- Service-level thinking: monitoring what users experience, not only host resource usage

## Practical usage

A practical starting point often includes:

- Host metrics from exporters
- Availability checks for critical endpoints
- Dashboards for infrastructure and core services
- Alerts for outages, storage pressure, certificate expiry, and failed backups

## Best practices

- Monitor both infrastructure health and service reachability
- Alert on conditions that require action
- Keep dashboards focused on questions operators actually ask
- Use monitoring data to validate upgrades and incident recovery

## Pitfalls

- Treating dashboards as a substitute for alerts
- Collecting far more data than anyone reviews
- Monitoring only CPU and RAM while ignoring ingress, DNS, and backups
- Sending noisy alerts that train operators to ignore them

## References

- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)