--- title: Service Monitoring description: Guide to building a basic monitoring stack for self-hosted services and infrastructure tags: - monitoring - self-hosting - observability category: self-hosting created: 2026-03-14 updated: 2026-03-14 --- # Service Monitoring ## Introduction Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action. ## Purpose This guide focuses on: - Host and service metrics - Uptime checks - Dashboards and alerting - Monitoring coverage for common homelab services ## Architecture Overview A small monitoring stack often includes: - Prometheus for scraping metrics - Exporters such as `node_exporter` for host metrics - Blackbox probing for endpoint availability - Grafana for dashboards - Alertmanager for notifications Typical flow: ```text Exporter or target -> Prometheus -> Grafana dashboards Prometheus alerts -> Alertmanager -> notification channel ``` ## Step-by-Step Guide ### 1. Start with host metrics Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup. ### 2. Scrape targets from Prometheus Example scrape config: ```yaml scrape_configs: - job_name: node static_configs: - targets: - "server-01.internal.example:9100" - "server-02.internal.example:9100" ``` ### 3. Add endpoint checks Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services. ### 4. Add dashboards and alerts Alert only on conditions that require action, such as: - Host down - Disk nearly full - Backup job missing - TLS certificate near expiry ## Configuration Example Example alert concept: ```yaml groups: - name: infrastructure rules: - alert: HostDown expr: up == 0 for: 5m labels: severity: critical ``` ## Troubleshooting Tips ### Metrics are missing for one host - Check exporter health on that host - Confirm firewall rules allow scraping - Verify the target name and port in the Prometheus config ### Alerts are noisy - Add `for` durations to avoid alerting on short blips - Remove alerts that never trigger action - Tune thresholds per service class rather than globally ### Dashboards look healthy while the service is down - Add blackbox checks in addition to internal metrics - Monitor the reverse proxy or external entry point, not only the app process - Track backups and certificate expiry separately from CPU and RAM ## Best Practices - Monitor the services users depend on, not only the hosts they run on - Keep alert volume low enough that alerts remain meaningful - Document the owner and response path for each critical alert - Treat backup freshness and certificate expiry as first-class signals - Start simple, then add coverage where operational pain justifies it ## References - [Prometheus overview](https://prometheus.io/docs/introduction/overview/) - [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/) - [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter) - [Grafana documentation](https://grafana.com/docs/grafana/latest/)