--- title: Monitoring and Observability description: Core concepts behind monitoring, alerting, and observability for self-hosted systems tags: - monitoring - observability - operations category: infrastructure created: 2026-03-14 updated: 2026-03-14 --- # Monitoring and Observability ## Summary Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally. ## Why it matters Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation. ## Core concepts - Metrics: numerical measurements over time - Logs: event records produced by systems and applications - Traces: request-path visibility across components - Alerting: notifications triggered by actionable failure conditions - Service-level thinking: monitoring what users experience, not only host resource usage ## Practical usage A practical starting point often includes: - Host metrics from exporters - Availability checks for critical endpoints - Dashboards for infrastructure and core services - Alerts for outages, storage pressure, certificate expiry, and failed backups ## Best practices - Monitor both infrastructure health and service reachability - Alert on conditions that require action - Keep dashboards focused on questions operators actually ask - Use monitoring data to validate upgrades and incident recovery ## Pitfalls - Treating dashboards as a substitute for alerts - Collecting far more data than anyone reviews - Monitoring only CPU and RAM while ignoring ingress, DNS, and backups - Sending noisy alerts that train operators to ignore them ## References - [Prometheus overview](https://prometheus.io/docs/introduction/overview/) - [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/) - [Grafana documentation](https://grafana.com/docs/grafana/latest/)