Files
Knowledge-Base/40 - Guides/self-hosting/service-monitoring.md

126 lines
3.2 KiB
Markdown

---
title: Service Monitoring
description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
tags:
- monitoring
- self-hosting
- observability
category: self-hosting
created: 2026-03-14
updated: 2026-03-14
---
# Service Monitoring
## Introduction
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
## Purpose
This guide focuses on:
- Host and service metrics
- Uptime checks
- Dashboards and alerting
- Monitoring coverage for common homelab services
## Architecture Overview
A small monitoring stack often includes:
- Prometheus for scraping metrics
- Exporters such as `node_exporter` for host metrics
- Blackbox probing for endpoint availability
- Grafana for dashboards
- Alertmanager for notifications
Typical flow:
```text
Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel
```
## Step-by-Step Guide
### 1. Start with host metrics
Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.
### 2. Scrape targets from Prometheus
Example scrape config:
```yaml
scrape_configs:
- job_name: node
static_configs:
- targets:
- "server-01.internal.example:9100"
- "server-02.internal.example:9100"
```
### 3. Add endpoint checks
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
### 4. Add dashboards and alerts
Alert only on conditions that require action, such as:
- Host down
- Disk nearly full
- Backup job missing
- TLS certificate near expiry
## Configuration Example
Example alert concept:
```yaml
groups:
- name: infrastructure
rules:
- alert: HostDown
expr: up == 0
for: 5m
labels:
severity: critical
```
## Troubleshooting Tips
### Metrics are missing for one host
- Check exporter health on that host
- Confirm firewall rules allow scraping
- Verify the target name and port in the Prometheus config
### Alerts are noisy
- Add `for` durations to avoid alerting on short blips
- Remove alerts that never trigger action
- Tune thresholds per service class rather than globally
### Dashboards look healthy while the service is down
- Add blackbox checks in addition to internal metrics
- Monitor the reverse proxy or external entry point, not only the app process
- Track backups and certificate expiry separately from CPU and RAM
## Best Practices
- Monitor the services users depend on, not only the hosts they run on
- Keep alert volume low enough that alerts remain meaningful
- Document the owner and response path for each critical alert
- Treat backup freshness and certificate expiry as first-class signals
- Start simple, then add coverage where operational pain justifies it
## References
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)