3.2 KiB
3.2 KiB
title, description, tags, category, created, updated
| title | description | tags | category | created | updated | |||
|---|---|---|---|---|---|---|---|---|
| Service Monitoring | Guide to building a basic monitoring stack for self-hosted services and infrastructure |
|
self-hosting | 2026-03-14 | 2026-03-14 |
Service Monitoring
Introduction
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
Purpose
This guide focuses on:
- Host and service metrics
- Uptime checks
- Dashboards and alerting
- Monitoring coverage for common homelab services
Architecture Overview
A small monitoring stack often includes:
- Prometheus for scraping metrics
- Exporters such as
node_exporterfor host metrics - Blackbox probing for endpoint availability
- Grafana for dashboards
- Alertmanager for notifications
Typical flow:
Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel
Step-by-Step Guide
1. Start with host metrics
Install node_exporter on important Linux hosts or run it in a controlled containerized setup.
2. Scrape targets from Prometheus
Example scrape config:
scrape_configs:
- job_name: node
static_configs:
- targets:
- "server-01.internal.example:9100"
- "server-02.internal.example:9100"
3. Add endpoint checks
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
4. Add dashboards and alerts
Alert only on conditions that require action, such as:
- Host down
- Disk nearly full
- Backup job missing
- TLS certificate near expiry
Configuration Example
Example alert concept:
groups:
- name: infrastructure
rules:
- alert: HostDown
expr: up == 0
for: 5m
labels:
severity: critical
Troubleshooting Tips
Metrics are missing for one host
- Check exporter health on that host
- Confirm firewall rules allow scraping
- Verify the target name and port in the Prometheus config
Alerts are noisy
- Add
fordurations to avoid alerting on short blips - Remove alerts that never trigger action
- Tune thresholds per service class rather than globally
Dashboards look healthy while the service is down
- Add blackbox checks in addition to internal metrics
- Monitor the reverse proxy or external entry point, not only the app process
- Track backups and certificate expiry separately from CPU and RAM
Best Practices
- Monitor the services users depend on, not only the hosts they run on
- Keep alert volume low enough that alerts remain meaningful
- Document the owner and response path for each critical alert
- Treat backup freshness and certificate expiry as first-class signals
- Start simple, then add coverage where operational pain justifies it