first version of the knowledge base :)

2026-03-14 11:41:54 +01:00
commit 27965301ad
47 changed files with 4356 additions and 0 deletions
--- a/Knowledge/infrastructure/monitoring-and-observability.md
+++ b/Knowledge/infrastructure/monitoring-and-observability.md
@@ -0,0 +1,58 @@
+---
+title: Monitoring and Observability
+description: Core concepts behind monitoring, alerting, and observability for self-hosted systems
+tags:
+  - monitoring
+  - observability
+  - operations
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Monitoring and Observability
+
+## Summary
+
+Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally.
+
+## Why it matters
+
+Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation.
+
+## Core concepts
+
+- Metrics: numerical measurements over time
+- Logs: event records produced by systems and applications
+- Traces: request-path visibility across components
+- Alerting: notifications triggered by actionable failure conditions
+- Service-level thinking: monitoring what users experience, not only host resource usage
+
+## Practical usage
+
+A practical starting point often includes:
+
+- Host metrics from exporters
+- Availability checks for critical endpoints
+- Dashboards for infrastructure and core services
+- Alerts for outages, storage pressure, certificate expiry, and failed backups
+
+## Best practices
+
+- Monitor both infrastructure health and service reachability
+- Alert on conditions that require action
+- Keep dashboards focused on questions operators actually ask
+- Use monitoring data to validate upgrades and incident recovery
+
+## Pitfalls
+
+- Treating dashboards as a substitute for alerts
+- Collecting far more data than anyone reviews
+- Monitoring only CPU and RAM while ignoring ingress, DNS, and backups
+- Sending noisy alerts that train operators to ignore them
+
+## References
+
+- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
+- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
+- [Grafana documentation](https://grafana.com/docs/grafana/latest/)