126 lines
3.2 KiB
Markdown
126 lines
3.2 KiB
Markdown
---
|
|
title: Service Monitoring
|
|
description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
|
|
tags:
|
|
- monitoring
|
|
- self-hosting
|
|
- observability
|
|
category: self-hosting
|
|
created: 2026-03-14
|
|
updated: 2026-03-14
|
|
---
|
|
|
|
# Service Monitoring
|
|
|
|
## Introduction
|
|
|
|
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
|
|
|
|
## Purpose
|
|
|
|
This guide focuses on:
|
|
|
|
- Host and service metrics
|
|
- Uptime checks
|
|
- Dashboards and alerting
|
|
- Monitoring coverage for common homelab services
|
|
|
|
## Architecture Overview
|
|
|
|
A small monitoring stack often includes:
|
|
|
|
- Prometheus for scraping metrics
|
|
- Exporters such as `node_exporter` for host metrics
|
|
- Blackbox probing for endpoint availability
|
|
- Grafana for dashboards
|
|
- Alertmanager for notifications
|
|
|
|
Typical flow:
|
|
|
|
```text
|
|
Exporter or target -> Prometheus -> Grafana dashboards
|
|
Prometheus alerts -> Alertmanager -> notification channel
|
|
```
|
|
|
|
## Step-by-Step Guide
|
|
|
|
### 1. Start with host metrics
|
|
|
|
Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.
|
|
|
|
### 2. Scrape targets from Prometheus
|
|
|
|
Example scrape config:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: node
|
|
static_configs:
|
|
- targets:
|
|
- "server-01.internal.example:9100"
|
|
- "server-02.internal.example:9100"
|
|
```
|
|
|
|
### 3. Add endpoint checks
|
|
|
|
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
|
|
|
|
### 4. Add dashboards and alerts
|
|
|
|
Alert only on conditions that require action, such as:
|
|
|
|
- Host down
|
|
- Disk nearly full
|
|
- Backup job missing
|
|
- TLS certificate near expiry
|
|
|
|
## Configuration Example
|
|
|
|
Example alert concept:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: infrastructure
|
|
rules:
|
|
- alert: HostDown
|
|
expr: up == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
```
|
|
|
|
## Troubleshooting Tips
|
|
|
|
### Metrics are missing for one host
|
|
|
|
- Check exporter health on that host
|
|
- Confirm firewall rules allow scraping
|
|
- Verify the target name and port in the Prometheus config
|
|
|
|
### Alerts are noisy
|
|
|
|
- Add `for` durations to avoid alerting on short blips
|
|
- Remove alerts that never trigger action
|
|
- Tune thresholds per service class rather than globally
|
|
|
|
### Dashboards look healthy while the service is down
|
|
|
|
- Add blackbox checks in addition to internal metrics
|
|
- Monitor the reverse proxy or external entry point, not only the app process
|
|
- Track backups and certificate expiry separately from CPU and RAM
|
|
|
|
## Best Practices
|
|
|
|
- Monitor the services users depend on, not only the hosts they run on
|
|
- Keep alert volume low enough that alerts remain meaningful
|
|
- Document the owner and response path for each critical alert
|
|
- Treat backup freshness and certificate expiry as first-class signals
|
|
- Start simple, then add coverage where operational pain justifies it
|
|
|
|
## References
|
|
|
|
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
|
|
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
|
|
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
|
|
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
|