first version of the knowledge base :)

This commit is contained in:
2026-03-14 11:41:54 +01:00
commit 27965301ad
47 changed files with 4356 additions and 0 deletions

View File

@@ -0,0 +1,121 @@
---
title: Backup Strategies
description: Practical backup strategy guidance for self-hosted services, containers, and virtualized homelabs
tags:
- backup
- self-hosting
- operations
category: self-hosting
created: 2026-03-14
updated: 2026-03-14
---
# Backup Strategies
## Introduction
Backups protect against deletion, corruption, hardware failure, ransomware, and operational mistakes. In self-hosted environments, a backup strategy should cover both data and the information needed to restore services correctly.
## Purpose
This guide covers:
- What to back up
- How often to back it up
- Where to store copies
- How to validate restore readiness
## Architecture Overview
A good strategy includes:
- Primary data backups
- Configuration and infrastructure backups
- Off-site or offline copies
- Restore testing
The 3-2-1 rule is a strong baseline:
- 3 copies of data
- 2 different media or storage systems
- 1 copy off-site
For higher assurance, also consider an immutable or offline copy and zero-error verification.
## Step-by-Step Guide
### 1. Inventory what matters
Back up:
- Databases
- Application data directories
- Compose files and infrastructure code
- DNS, reverse proxy, and secrets configuration
- Hypervisor or VM backup metadata
### 2. Choose backup tools by workload
- File-level backups: restic, Borg, rsync-based workflows
- VM backups: hypervisor-integrated backup jobs
- Database-aware backups: logical dumps or physical backup tools where needed
### 3. Schedule and retain intelligently
Use a retention policy that matches recovery needs. Short retention for frequent snapshots and longer retention for off-site backups is common.
### 4. Test restores
Backups are incomplete until you can restore and start the service successfully.
## Configuration Example
Restic backup example:
```bash
export RESTIC_REPOSITORY=/backup/restic
export RESTIC_PASSWORD_FILE=/run/secrets/restic_password
restic backup /srv/app-data /srv/compose
restic snapshots
```
Example restore check:
```bash
restic restore latest --target /tmp/restore-check
```
## Troubleshooting Tips
### Backups exist but restores are incomplete
- Confirm databases were backed up consistently, not mid-write without support
- Verify application config and secret material were included
- Check permissions and ownership in the restored data
### Repository size grows too quickly
- Review retention rules and pruning behavior
- Exclude caches, transient files, and rebuildable artifacts
- Split hot data from archival data if retention needs differ
### Backups run but nobody notices failures
- Alert on backup freshness and last successful run
- Record the restore procedure for each critical service
- Test restores on a schedule, not only after incidents
## Best Practices
- Back up both data and the configuration needed to use it
- Keep at least one copy outside the main failure domain
- Prefer encrypted backup repositories for off-site storage
- Automate backup jobs and monitor their success
- Practice restores for your most important services first
## References
- [restic documentation](https://restic.readthedocs.io/en/latest/)
- [BorgBackup documentation](https://borgbackup.readthedocs.io/en/stable/)
- [Proxmox VE Backup and Restore](https://pve.proxmox.com/pve-docs/chapter-vzdump.html)

View File

@@ -0,0 +1,125 @@
---
title: Service Monitoring
description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
tags:
- monitoring
- self-hosting
- observability
category: self-hosting
created: 2026-03-14
updated: 2026-03-14
---
# Service Monitoring
## Introduction
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
## Purpose
This guide focuses on:
- Host and service metrics
- Uptime checks
- Dashboards and alerting
- Monitoring coverage for common homelab services
## Architecture Overview
A small monitoring stack often includes:
- Prometheus for scraping metrics
- Exporters such as `node_exporter` for host metrics
- Blackbox probing for endpoint availability
- Grafana for dashboards
- Alertmanager for notifications
Typical flow:
```text
Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channel
```
## Step-by-Step Guide
### 1. Start with host metrics
Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.
### 2. Scrape targets from Prometheus
Example scrape config:
```yaml
scrape_configs:
- job_name: node
static_configs:
- targets:
- "server-01.internal.example:9100"
- "server-02.internal.example:9100"
```
### 3. Add endpoint checks
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
### 4. Add dashboards and alerts
Alert only on conditions that require action, such as:
- Host down
- Disk nearly full
- Backup job missing
- TLS certificate near expiry
## Configuration Example
Example alert concept:
```yaml
groups:
- name: infrastructure
rules:
- alert: HostDown
expr: up == 0
for: 5m
labels:
severity: critical
```
## Troubleshooting Tips
### Metrics are missing for one host
- Check exporter health on that host
- Confirm firewall rules allow scraping
- Verify the target name and port in the Prometheus config
### Alerts are noisy
- Add `for` durations to avoid alerting on short blips
- Remove alerts that never trigger action
- Tune thresholds per service class rather than globally
### Dashboards look healthy while the service is down
- Add blackbox checks in addition to internal metrics
- Monitor the reverse proxy or external entry point, not only the app process
- Track backups and certificate expiry separately from CPU and RAM
## Best Practices
- Monitor the services users depend on, not only the hosts they run on
- Keep alert volume low enough that alerts remain meaningful
- Document the owner and response path for each critical alert
- Treat backup freshness and certificate expiry as first-class signals
- Start simple, then add coverage where operational pain justifies it
## References
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)

View File

@@ -0,0 +1,124 @@
---
title: Update Management
description: Practical update management for Linux hosts, containers, and self-hosted services
tags:
- updates
- patching
- self-hosting
category: self-hosting
created: 2026-03-14
updated: 2026-03-14
---
# Update Management
## Introduction
Update management keeps systems secure and supportable without turning every patch cycle into an outage. In self-hosted environments, the challenge is balancing security, uptime, and limited operator time.
## Purpose
This guide focuses on:
- Operating system updates
- Container and dependency updates
- Scheduling, staging, and rollback planning
## Architecture Overview
A practical update process has four layers:
- Inventory: know what you run
- Detection: know when updates are available
- Deployment: apply updates in a controlled order
- Validation: confirm services still work
## Step-by-Step Guide
### 1. Separate systems by risk
Create update rings such as:
- Ring 1: non-critical test systems
- Ring 2: internal services
- Ring 3: critical stateful services and edge entry points
### 2. Automate security updates where safe
For Linux hosts, automated security updates can reduce patch delay for low-risk packages. Review distribution guidance and keep reboots controlled.
### 3. Automate update discovery
Use tools that open reviewable pull requests or dashboards for:
- Container image updates
- Dependency updates
- Operating system patch reporting
### 4. Validate after rollout
Confirm:
- Service health
- Reverse proxy reachability
- Backup jobs
- Monitoring and alerting
## Configuration Example
Ubuntu unattended upgrades example:
```text
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
```
Dependency update automation example:
```json
{
"extends": ["config:recommended"],
"schedule": ["before 5am on monday"],
"packageRules": [
{
"matchUpdateTypes": ["major"],
"automerge": false
}
]
}
```
## Troubleshooting Tips
### Updates are applied but regressions go unnoticed
- Add post-update health checks
- Review dashboards and key alerts after patch windows
- Keep rollback or restore steps documented for stateful services
### Too many update notifications create fatigue
- Group low-risk updates into maintenance windows
- Separate critical security issues from routine version bumps
- Use labels or dashboards to prioritize by service importance
### Containers stay outdated even though automation exists
- Verify image digests and registry visibility
- Confirm the deployment process actually recreates containers after image updates
- Prefer reviewed rebuild and redeploy workflows over blind runtime mutation for important services
## Best Practices
- Patch internet-exposed and admin-facing services first
- Stage risky or major updates through lower-risk environments
- Prefer reviewable dependency automation over silent uncontrolled updates
- Keep maintenance windows small and predictable
- Document rollback expectations before making large version jumps
## References
- [Ubuntu Community Help Wiki: Automatic Security Updates](https://help.ubuntu.com/community/AutomaticSecurityUpdates)
- [Debian Wiki: UnattendedUpgrades](https://wiki.debian.org/UnattendedUpgrades)
- [Renovate documentation](https://docs.renovatebot.com/)
- [GitHub Docs: Configuring Dependabot version updates](https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuring-dependabot-version-updates)