Knowledge-Base/40 - Guides/self-hosting/service-monitoring.md

---
title: Service Monitoring
description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
tags:
  - monitoring
  - self-hosting
  - observability
category: self-hosting
created: 2026-03-14
updated: 2026-03-14
---

# Service Monitoring

## Introduction

Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.

## Purpose

This guide focuses on:

- Host and service metrics
- Uptime checks
- Dashboards and alerting
- Monitoring coverage for common homelab services

## Architecture Overview

A small monitoring stack often includes:

- Prometheus for scraping metrics
- Exporters such as `node_exporter` for host metrics
- Blackbox probing for endpoint availability
- Grafana for dashboards
- Alertmanager for notifications

Typical flow:

```text
Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel
```

## Step-by-Step Guide

### 1. Start with host metrics

Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.

### 2. Scrape targets from Prometheus

Example scrape config:

```yaml
scrape_configs:
  - job_name: node
    static_configs:
      - targets:
          - "server-01.internal.example:9100"
          - "server-02.internal.example:9100"
```

### 3. Add endpoint checks

Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.

### 4. Add dashboards and alerts

Alert only on conditions that require action, such as:

- Host down
- Disk nearly full
- Backup job missing
- TLS certificate near expiry

## Configuration Example

Example alert concept:

```yaml
groups:
  - name: infrastructure
    rules:
      - alert: HostDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
```

## Troubleshooting Tips

### Metrics are missing for one host

- Check exporter health on that host
- Confirm firewall rules allow scraping
- Verify the target name and port in the Prometheus config

### Alerts are noisy

- Add `for` durations to avoid alerting on short blips
- Remove alerts that never trigger action
- Tune thresholds per service class rather than globally

### Dashboards look healthy while the service is down

- Add blackbox checks in addition to internal metrics
- Monitor the reverse proxy or external entry point, not only the app process
- Track backups and certificate expiry separately from CPU and RAM

## Best Practices

- Monitor the services users depend on, not only the hosts they run on
- Keep alert volume low enough that alerts remain meaningful
- Document the owner and response path for each critical alert
- Treat backup freshness and certificate expiry as first-class signals
- Start simple, then add coverage where operational pain justifies it

## References

- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)