first version of the knowledge base :)

2026-03-14 11:41:54 +01:00
commit 27965301ad
47 changed files with 4356 additions and 0 deletions
--- a/Guides/self-hosting/backup-strategies.md
+++ b/Guides/self-hosting/backup-strategies.md
@@ -0,0 +1,121 @@
+---
+title: Backup Strategies
+description: Practical backup strategy guidance for self-hosted services, containers, and virtualized homelabs
+tags:
+  - backup
+  - self-hosting
+  - operations
+category: self-hosting
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Backup Strategies
+
+## Introduction
+
+Backups protect against deletion, corruption, hardware failure, ransomware, and operational mistakes. In self-hosted environments, a backup strategy should cover both data and the information needed to restore services correctly.
+
+## Purpose
+
+This guide covers:
+
+- What to back up
+- How often to back it up
+- Where to store copies
+- How to validate restore readiness
+
+## Architecture Overview
+
+A good strategy includes:
+
+- Primary data backups
+- Configuration and infrastructure backups
+- Off-site or offline copies
+- Restore testing
+
+The 3-2-1 rule is a strong baseline:
+
+- 3 copies of data
+- 2 different media or storage systems
+- 1 copy off-site
+
+For higher assurance, also consider an immutable or offline copy and zero-error verification.
+
+## Step-by-Step Guide
+
+### 1. Inventory what matters
+
+Back up:
+
+- Databases
+- Application data directories
+- Compose files and infrastructure code
+- DNS, reverse proxy, and secrets configuration
+- Hypervisor or VM backup metadata
+
+### 2. Choose backup tools by workload
+
+- File-level backups: restic, Borg, rsync-based workflows
+- VM backups: hypervisor-integrated backup jobs
+- Database-aware backups: logical dumps or physical backup tools where needed
+
+### 3. Schedule and retain intelligently
+
+Use a retention policy that matches recovery needs. Short retention for frequent snapshots and longer retention for off-site backups is common.
+
+### 4. Test restores
+
+Backups are incomplete until you can restore and start the service successfully.
+
+## Configuration Example
+
+Restic backup example:
+
+```bash
+export RESTIC_REPOSITORY=/backup/restic
+export RESTIC_PASSWORD_FILE=/run/secrets/restic_password
+
+restic backup /srv/app-data /srv/compose
+restic snapshots
+```
+
+Example restore check:
+
+```bash
+restic restore latest --target /tmp/restore-check
+```
+
+## Troubleshooting Tips
+
+### Backups exist but restores are incomplete
+
+- Confirm databases were backed up consistently, not mid-write without support
+- Verify application config and secret material were included
+- Check permissions and ownership in the restored data
+
+### Repository size grows too quickly
+
+- Review retention rules and pruning behavior
+- Exclude caches, transient files, and rebuildable artifacts
+- Split hot data from archival data if retention needs differ
+
+### Backups run but nobody notices failures
+
+- Alert on backup freshness and last successful run
+- Record the restore procedure for each critical service
+- Test restores on a schedule, not only after incidents
+
+## Best Practices
+
+- Back up both data and the configuration needed to use it
+- Keep at least one copy outside the main failure domain
+- Prefer encrypted backup repositories for off-site storage
+- Automate backup jobs and monitor their success
+- Practice restores for your most important services first
+
+## References
+
+- [restic documentation](https://restic.readthedocs.io/en/latest/)
+- [BorgBackup documentation](https://borgbackup.readthedocs.io/en/stable/)
+- [Proxmox VE Backup and Restore](https://pve.proxmox.com/pve-docs/chapter-vzdump.html)
--- a/Guides/self-hosting/service-monitoring.md
+++ b/Guides/self-hosting/service-monitoring.md
@@ -0,0 +1,125 @@
+---
+title: Service Monitoring
+description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
+tags:
+  - monitoring
+  - self-hosting
+  - observability
+category: self-hosting
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Service Monitoring
+
+## Introduction
+
+Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
+
+## Purpose
+
+This guide focuses on:
+
+- Host and service metrics
+- Uptime checks
+- Dashboards and alerting
+- Monitoring coverage for common homelab services
+
+## Architecture Overview
+
+A small monitoring stack often includes:
+
+- Prometheus for scraping metrics
+- Exporters such as `node_exporter` for host metrics
+- Blackbox probing for endpoint availability
+- Grafana for dashboards
+- Alertmanager for notifications
+
+Typical flow:
+
+```text
+Exporter or target -> Prometheus -> Grafana dashboards
+Prometheus alerts -> Alertmanager -> notification channel
+```
+
+## Step-by-Step Guide
+
+### 1. Start with host metrics
+
+Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.
+
+### 2. Scrape targets from Prometheus
+
+Example scrape config:
+
+```yaml
+scrape_configs:
+  - job_name: node
+    static_configs:
+      - targets:
+          - "server-01.internal.example:9100"
+          - "server-02.internal.example:9100"
+```
+
+### 3. Add endpoint checks
+
+Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
+
+### 4. Add dashboards and alerts
+
+Alert only on conditions that require action, such as:
+
+- Host down
+- Disk nearly full
+- Backup job missing
+- TLS certificate near expiry
+
+## Configuration Example
+
+Example alert concept:
+
+```yaml
+groups:
+  - name: infrastructure
+    rules:
+      - alert: HostDown
+        expr: up == 0
+        for: 5m
+        labels:
+          severity: critical
+```
+
+## Troubleshooting Tips
+
+### Metrics are missing for one host
+
+- Check exporter health on that host
+- Confirm firewall rules allow scraping
+- Verify the target name and port in the Prometheus config
+
+### Alerts are noisy
+
+- Add `for` durations to avoid alerting on short blips
+- Remove alerts that never trigger action
+- Tune thresholds per service class rather than globally
+
+### Dashboards look healthy while the service is down
+
+- Add blackbox checks in addition to internal metrics
+- Monitor the reverse proxy or external entry point, not only the app process
+- Track backups and certificate expiry separately from CPU and RAM
+
+## Best Practices
+
+- Monitor the services users depend on, not only the hosts they run on
+- Keep alert volume low enough that alerts remain meaningful
+- Document the owner and response path for each critical alert
+- Treat backup freshness and certificate expiry as first-class signals
+- Start simple, then add coverage where operational pain justifies it
+
+## References
+
+- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
+- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
+- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
+- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
--- a/Guides/self-hosting/update-management.md
+++ b/Guides/self-hosting/update-management.md
@@ -0,0 +1,124 @@
+---
+title: Update Management
+description: Practical update management for Linux hosts, containers, and self-hosted services
+tags:
+  - updates
+  - patching
+  - self-hosting
+category: self-hosting
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Update Management
+
+## Introduction
+
+Update management keeps systems secure and supportable without turning every patch cycle into an outage. In self-hosted environments, the challenge is balancing security, uptime, and limited operator time.
+
+## Purpose
+
+This guide focuses on:
+
+- Operating system updates
+- Container and dependency updates
+- Scheduling, staging, and rollback planning
+
+## Architecture Overview
+
+A practical update process has four layers:
+
+- Inventory: know what you run
+- Detection: know when updates are available
+- Deployment: apply updates in a controlled order
+- Validation: confirm services still work
+
+## Step-by-Step Guide
+
+### 1. Separate systems by risk
+
+Create update rings such as:
+
+- Ring 1: non-critical test systems
+- Ring 2: internal services
+- Ring 3: critical stateful services and edge entry points
+
+### 2. Automate security updates where safe
+
+For Linux hosts, automated security updates can reduce patch delay for low-risk packages. Review distribution guidance and keep reboots controlled.
+
+### 3. Automate update discovery
+
+Use tools that open reviewable pull requests or dashboards for:
+
+- Container image updates
+- Dependency updates
+- Operating system patch reporting
+
+### 4. Validate after rollout
+
+Confirm:
+
+- Service health
+- Reverse proxy reachability
+- Backup jobs
+- Monitoring and alerting
+
+## Configuration Example
+
+Ubuntu unattended upgrades example:
+
+```text
+APT::Periodic::Update-Package-Lists "1";
+APT::Periodic::Unattended-Upgrade "1";
+```
+
+Dependency update automation example:
+
+```json
+{
+  "extends": ["config:recommended"],
+  "schedule": ["before 5am on monday"],
+  "packageRules": [
+    {
+      "matchUpdateTypes": ["major"],
+      "automerge": false
+    }
+  ]
+}
+```
+
+## Troubleshooting Tips
+
+### Updates are applied but regressions go unnoticed
+
+- Add post-update health checks
+- Review dashboards and key alerts after patch windows
+- Keep rollback or restore steps documented for stateful services
+
+### Too many update notifications create fatigue
+
+- Group low-risk updates into maintenance windows
+- Separate critical security issues from routine version bumps
+- Use labels or dashboards to prioritize by service importance
+
+### Containers stay outdated even though automation exists
+
+- Verify image digests and registry visibility
+- Confirm the deployment process actually recreates containers after image updates
+- Prefer reviewed rebuild and redeploy workflows over blind runtime mutation for important services
+
+## Best Practices
+
+- Patch internet-exposed and admin-facing services first
+- Stage risky or major updates through lower-risk environments
+- Prefer reviewable dependency automation over silent uncontrolled updates
+- Keep maintenance windows small and predictable
+- Document rollback expectations before making large version jumps
+
+## References
+
+- [Ubuntu Community Help Wiki: Automatic Security Updates](https://help.ubuntu.com/community/AutomaticSecurityUpdates)
+- [Debian Wiki: UnattendedUpgrades](https://wiki.debian.org/UnattendedUpgrades)
+- [Renovate documentation](https://docs.renovatebot.com/)
+- [GitHub Docs: Configuring Dependabot version updates](https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuring-dependabot-version-updates)