first version of the knowledge base :)
This commit is contained in:
67
30 - Systems/observability/monitoring-stack-architecture.md
Normal file
67
30 - Systems/observability/monitoring-stack-architecture.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
title: Monitoring Stack Architecture
|
||||
description: Reference architecture for a monitoring stack in a self-hosted or homelab environment
|
||||
tags:
|
||||
- monitoring
|
||||
- observability
|
||||
- architecture
|
||||
category: systems
|
||||
created: 2026-03-14
|
||||
updated: 2026-03-14
|
||||
---
|
||||
|
||||
# Monitoring Stack Architecture
|
||||
|
||||
## Summary
|
||||
|
||||
A monitoring stack architecture defines how metrics, probes, dashboards, and alerts fit together. In self-hosted environments, the stack should stay small enough to operate but broad enough to cover infrastructure, ingress, and critical services.
|
||||
|
||||
## Why it matters
|
||||
|
||||
Monitoring that is bolted on late often misses the services operators actually depend on. A planned stack architecture makes it easier to understand where signals come from and how alerts reach the right people.
|
||||
|
||||
## Core concepts
|
||||
|
||||
- Collection: exporters and scrape targets
|
||||
- Storage and evaluation: Prometheus
|
||||
- Visualization: Grafana
|
||||
- Alert routing: Alertmanager
|
||||
- External validation: blackbox or equivalent endpoint checks
|
||||
|
||||
## Practical usage
|
||||
|
||||
Typical architecture:
|
||||
|
||||
```text
|
||||
Hosts and services -> Exporters / probes -> Prometheus
|
||||
Prometheus -> Grafana dashboards
|
||||
Prometheus -> Alertmanager -> notification channel
|
||||
```
|
||||
|
||||
Recommended coverage:
|
||||
|
||||
- Host metrics for compute and storage systems
|
||||
- Endpoint checks for user-facing services
|
||||
- Backup freshness and certificate expiry
|
||||
- Platform services such as DNS, reverse proxy, and identity provider
|
||||
|
||||
## Best practices
|
||||
|
||||
- Monitor the path users depend on, not only the host underneath it
|
||||
- Keep the monitoring stack itself backed up and access controlled
|
||||
- Alert on actionable failures rather than every threshold crossing
|
||||
- Document ownership for critical alerts and dashboards
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- Monitoring only CPU and memory while ignoring ingress and backups
|
||||
- Running a complex stack with no retention or alert review policy
|
||||
- Depending on dashboards alone for outage detection
|
||||
- Forgetting to monitor the monitoring components themselves
|
||||
|
||||
## References
|
||||
|
||||
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
|
||||
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
|
||||
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
|
||||
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
|
||||
Reference in New Issue
Block a user