Hiddenden/Knowledge-Base

Fork 0

Files

Latte 27965301ad

first version of the knowledge base :)

2026-03-14 11:41:54 +01:00

3.2 KiB

Raw Permalink Blame History

title, description, tags, category, created, updated

title

description

Service Monitoring

Introduction

Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.

Purpose

This guide focuses on:

Host and service metrics
Uptime checks
Dashboards and alerting
Monitoring coverage for common homelab services

Architecture Overview

A small monitoring stack often includes:

Prometheus for scraping metrics
Exporters such as node_exporter for host metrics
Blackbox probing for endpoint availability
Grafana for dashboards
Alertmanager for notifications

Typical flow:

Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel

Step-by-Step Guide

1. Start with host metrics

Install node_exporter on important Linux hosts or run it in a controlled containerized setup.

2. Scrape targets from Prometheus

Example scrape config:

scrape_configs:
  - job_name: node
    static_configs:
      - targets:
          - "server-01.internal.example:9100"
          - "server-02.internal.example:9100"

3. Add endpoint checks

Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.

4. Add dashboards and alerts

Alert only on conditions that require action, such as:

Host down
Disk nearly full
Backup job missing
TLS certificate near expiry

Configuration Example

Example alert concept:

groups:
  - name: infrastructure
    rules:
      - alert: HostDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical

Troubleshooting Tips

Metrics are missing for one host

Check exporter health on that host
Confirm firewall rules allow scraping
Verify the target name and port in the Prometheus config

Alerts are noisy

Add for durations to avoid alerting on short blips
Remove alerts that never trigger action
Tune thresholds per service class rather than globally

Dashboards look healthy while the service is down

Add blackbox checks in addition to internal metrics
Monitor the reverse proxy or external entry point, not only the app process
Track backups and certificate expiry separately from CPU and RAM

Best Practices

Monitor the services users depend on, not only the hosts they run on
Keep alert volume low enough that alerts remain meaningful
Document the owner and response path for each critical alert
Treat backup freshness and certificate expiry as first-class signals
Start simple, then add coverage where operational pain justifies it

3.2 KiB Raw Permalink Blame History