first version of the knowledge base :)

2026-03-14 11:41:54 +01:00
commit 27965301ad
47 changed files with 4356 additions and 0 deletions
--- a/Systems/observability/monitoring-stack-architecture.md
+++ b/Systems/observability/monitoring-stack-architecture.md
@@ -0,0 +1,67 @@
+---
+title: Monitoring Stack Architecture
+description: Reference architecture for a monitoring stack in a self-hosted or homelab environment
+tags:
+  - monitoring
+  - observability
+  - architecture
+category: systems
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Monitoring Stack Architecture
+
+## Summary
+
+A monitoring stack architecture defines how metrics, probes, dashboards, and alerts fit together. In self-hosted environments, the stack should stay small enough to operate but broad enough to cover infrastructure, ingress, and critical services.
+
+## Why it matters
+
+Monitoring that is bolted on late often misses the services operators actually depend on. A planned stack architecture makes it easier to understand where signals come from and how alerts reach the right people.
+
+## Core concepts
+
+- Collection: exporters and scrape targets
+- Storage and evaluation: Prometheus
+- Visualization: Grafana
+- Alert routing: Alertmanager
+- External validation: blackbox or equivalent endpoint checks
+
+## Practical usage
+
+Typical architecture:
+
+```text
+Hosts and services -> Exporters / probes -> Prometheus
+Prometheus -> Grafana dashboards
+Prometheus -> Alertmanager -> notification channel
+```
+
+Recommended coverage:
+
+- Host metrics for compute and storage systems
+- Endpoint checks for user-facing services
+- Backup freshness and certificate expiry
+- Platform services such as DNS, reverse proxy, and identity provider
+
+## Best practices
+
+- Monitor the path users depend on, not only the host underneath it
+- Keep the monitoring stack itself backed up and access controlled
+- Alert on actionable failures rather than every threshold crossing
+- Document ownership for critical alerts and dashboards
+
+## Pitfalls
+
+- Monitoring only CPU and memory while ignoring ingress and backups
+- Running a complex stack with no retention or alert review policy
+- Depending on dashboards alone for outage detection
+- Forgetting to monitor the monitoring components themselves
+
+## References
+
+- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
+- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
+- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
+- [Grafana documentation](https://grafana.com/docs/grafana/latest/)