first version of the knowledge base :)
This commit is contained in:
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: Monitoring and Observability
|
||||
description: Core concepts behind monitoring, alerting, and observability for self-hosted systems
|
||||
tags:
|
||||
- monitoring
|
||||
- observability
|
||||
- operations
|
||||
category: infrastructure
|
||||
created: 2026-03-14
|
||||
updated: 2026-03-14
|
||||
---
|
||||
|
||||
# Monitoring and Observability
|
||||
|
||||
## Summary
|
||||
|
||||
Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally.
|
||||
|
||||
## Why it matters
|
||||
|
||||
Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation.
|
||||
|
||||
## Core concepts
|
||||
|
||||
- Metrics: numerical measurements over time
|
||||
- Logs: event records produced by systems and applications
|
||||
- Traces: request-path visibility across components
|
||||
- Alerting: notifications triggered by actionable failure conditions
|
||||
- Service-level thinking: monitoring what users experience, not only host resource usage
|
||||
|
||||
## Practical usage
|
||||
|
||||
A practical starting point often includes:
|
||||
|
||||
- Host metrics from exporters
|
||||
- Availability checks for critical endpoints
|
||||
- Dashboards for infrastructure and core services
|
||||
- Alerts for outages, storage pressure, certificate expiry, and failed backups
|
||||
|
||||
## Best practices
|
||||
|
||||
- Monitor both infrastructure health and service reachability
|
||||
- Alert on conditions that require action
|
||||
- Keep dashboards focused on questions operators actually ask
|
||||
- Use monitoring data to validate upgrades and incident recovery
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- Treating dashboards as a substitute for alerts
|
||||
- Collecting far more data than anyone reviews
|
||||
- Monitoring only CPU and RAM while ignoring ingress, DNS, and backups
|
||||
- Sending noisy alerts that train operators to ignore them
|
||||
|
||||
## References
|
||||
|
||||
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
|
||||
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
|
||||
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
|
||||
Reference in New Issue
Block a user