first version of the knowledge base :)
This commit is contained in:
121
40 - Guides/self-hosting/backup-strategies.md
Normal file
121
40 - Guides/self-hosting/backup-strategies.md
Normal file
@@ -0,0 +1,121 @@
|
||||
---
|
||||
title: Backup Strategies
|
||||
description: Practical backup strategy guidance for self-hosted services, containers, and virtualized homelabs
|
||||
tags:
|
||||
- backup
|
||||
- self-hosting
|
||||
- operations
|
||||
category: self-hosting
|
||||
created: 2026-03-14
|
||||
updated: 2026-03-14
|
||||
---
|
||||
|
||||
# Backup Strategies
|
||||
|
||||
## Introduction
|
||||
|
||||
Backups protect against deletion, corruption, hardware failure, ransomware, and operational mistakes. In self-hosted environments, a backup strategy should cover both data and the information needed to restore services correctly.
|
||||
|
||||
## Purpose
|
||||
|
||||
This guide covers:
|
||||
|
||||
- What to back up
|
||||
- How often to back it up
|
||||
- Where to store copies
|
||||
- How to validate restore readiness
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
A good strategy includes:
|
||||
|
||||
- Primary data backups
|
||||
- Configuration and infrastructure backups
|
||||
- Off-site or offline copies
|
||||
- Restore testing
|
||||
|
||||
The 3-2-1 rule is a strong baseline:
|
||||
|
||||
- 3 copies of data
|
||||
- 2 different media or storage systems
|
||||
- 1 copy off-site
|
||||
|
||||
For higher assurance, also consider an immutable or offline copy and zero-error verification.
|
||||
|
||||
## Step-by-Step Guide
|
||||
|
||||
### 1. Inventory what matters
|
||||
|
||||
Back up:
|
||||
|
||||
- Databases
|
||||
- Application data directories
|
||||
- Compose files and infrastructure code
|
||||
- DNS, reverse proxy, and secrets configuration
|
||||
- Hypervisor or VM backup metadata
|
||||
|
||||
### 2. Choose backup tools by workload
|
||||
|
||||
- File-level backups: restic, Borg, rsync-based workflows
|
||||
- VM backups: hypervisor-integrated backup jobs
|
||||
- Database-aware backups: logical dumps or physical backup tools where needed
|
||||
|
||||
### 3. Schedule and retain intelligently
|
||||
|
||||
Use a retention policy that matches recovery needs. Short retention for frequent snapshots and longer retention for off-site backups is common.
|
||||
|
||||
### 4. Test restores
|
||||
|
||||
Backups are incomplete until you can restore and start the service successfully.
|
||||
|
||||
## Configuration Example
|
||||
|
||||
Restic backup example:
|
||||
|
||||
```bash
|
||||
export RESTIC_REPOSITORY=/backup/restic
|
||||
export RESTIC_PASSWORD_FILE=/run/secrets/restic_password
|
||||
|
||||
restic backup /srv/app-data /srv/compose
|
||||
restic snapshots
|
||||
```
|
||||
|
||||
Example restore check:
|
||||
|
||||
```bash
|
||||
restic restore latest --target /tmp/restore-check
|
||||
```
|
||||
|
||||
## Troubleshooting Tips
|
||||
|
||||
### Backups exist but restores are incomplete
|
||||
|
||||
- Confirm databases were backed up consistently, not mid-write without support
|
||||
- Verify application config and secret material were included
|
||||
- Check permissions and ownership in the restored data
|
||||
|
||||
### Repository size grows too quickly
|
||||
|
||||
- Review retention rules and pruning behavior
|
||||
- Exclude caches, transient files, and rebuildable artifacts
|
||||
- Split hot data from archival data if retention needs differ
|
||||
|
||||
### Backups run but nobody notices failures
|
||||
|
||||
- Alert on backup freshness and last successful run
|
||||
- Record the restore procedure for each critical service
|
||||
- Test restores on a schedule, not only after incidents
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Back up both data and the configuration needed to use it
|
||||
- Keep at least one copy outside the main failure domain
|
||||
- Prefer encrypted backup repositories for off-site storage
|
||||
- Automate backup jobs and monitor their success
|
||||
- Practice restores for your most important services first
|
||||
|
||||
## References
|
||||
|
||||
- [restic documentation](https://restic.readthedocs.io/en/latest/)
|
||||
- [BorgBackup documentation](https://borgbackup.readthedocs.io/en/stable/)
|
||||
- [Proxmox VE Backup and Restore](https://pve.proxmox.com/pve-docs/chapter-vzdump.html)
|
||||
125
40 - Guides/self-hosting/service-monitoring.md
Normal file
125
40 - Guides/self-hosting/service-monitoring.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
title: Service Monitoring
|
||||
description: Guide to building a basic monitoring stack for self-hosted services and infrastructure
|
||||
tags:
|
||||
- monitoring
|
||||
- self-hosting
|
||||
- observability
|
||||
category: self-hosting
|
||||
created: 2026-03-14
|
||||
updated: 2026-03-14
|
||||
---
|
||||
|
||||
# Service Monitoring
|
||||
|
||||
## Introduction
|
||||
|
||||
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
|
||||
|
||||
## Purpose
|
||||
|
||||
This guide focuses on:
|
||||
|
||||
- Host and service metrics
|
||||
- Uptime checks
|
||||
- Dashboards and alerting
|
||||
- Monitoring coverage for common homelab services
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
A small monitoring stack often includes:
|
||||
|
||||
- Prometheus for scraping metrics
|
||||
- Exporters such as `node_exporter` for host metrics
|
||||
- Blackbox probing for endpoint availability
|
||||
- Grafana for dashboards
|
||||
- Alertmanager for notifications
|
||||
|
||||
Typical flow:
|
||||
|
||||
```text
|
||||
Exporter or target -> Prometheus -> Grafana dashboards
|
||||
Prometheus alerts -> Alertmanager -> notification channel
|
||||
```
|
||||
|
||||
## Step-by-Step Guide
|
||||
|
||||
### 1. Start with host metrics
|
||||
|
||||
Install `node_exporter` on important Linux hosts or run it in a controlled containerized setup.
|
||||
|
||||
### 2. Scrape targets from Prometheus
|
||||
|
||||
Example scrape config:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: node
|
||||
static_configs:
|
||||
- targets:
|
||||
- "server-01.internal.example:9100"
|
||||
- "server-02.internal.example:9100"
|
||||
```
|
||||
|
||||
### 3. Add endpoint checks
|
||||
|
||||
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
|
||||
|
||||
### 4. Add dashboards and alerts
|
||||
|
||||
Alert only on conditions that require action, such as:
|
||||
|
||||
- Host down
|
||||
- Disk nearly full
|
||||
- Backup job missing
|
||||
- TLS certificate near expiry
|
||||
|
||||
## Configuration Example
|
||||
|
||||
Example alert concept:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: infrastructure
|
||||
rules:
|
||||
- alert: HostDown
|
||||
expr: up == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
## Troubleshooting Tips
|
||||
|
||||
### Metrics are missing for one host
|
||||
|
||||
- Check exporter health on that host
|
||||
- Confirm firewall rules allow scraping
|
||||
- Verify the target name and port in the Prometheus config
|
||||
|
||||
### Alerts are noisy
|
||||
|
||||
- Add `for` durations to avoid alerting on short blips
|
||||
- Remove alerts that never trigger action
|
||||
- Tune thresholds per service class rather than globally
|
||||
|
||||
### Dashboards look healthy while the service is down
|
||||
|
||||
- Add blackbox checks in addition to internal metrics
|
||||
- Monitor the reverse proxy or external entry point, not only the app process
|
||||
- Track backups and certificate expiry separately from CPU and RAM
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Monitor the services users depend on, not only the hosts they run on
|
||||
- Keep alert volume low enough that alerts remain meaningful
|
||||
- Document the owner and response path for each critical alert
|
||||
- Treat backup freshness and certificate expiry as first-class signals
|
||||
- Start simple, then add coverage where operational pain justifies it
|
||||
|
||||
## References
|
||||
|
||||
- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
|
||||
- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
|
||||
- [Prometheus `node_exporter`](https://github.com/prometheus/node_exporter)
|
||||
- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
|
||||
124
40 - Guides/self-hosting/update-management.md
Normal file
124
40 - Guides/self-hosting/update-management.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
title: Update Management
|
||||
description: Practical update management for Linux hosts, containers, and self-hosted services
|
||||
tags:
|
||||
- updates
|
||||
- patching
|
||||
- self-hosting
|
||||
category: self-hosting
|
||||
created: 2026-03-14
|
||||
updated: 2026-03-14
|
||||
---
|
||||
|
||||
# Update Management
|
||||
|
||||
## Introduction
|
||||
|
||||
Update management keeps systems secure and supportable without turning every patch cycle into an outage. In self-hosted environments, the challenge is balancing security, uptime, and limited operator time.
|
||||
|
||||
## Purpose
|
||||
|
||||
This guide focuses on:
|
||||
|
||||
- Operating system updates
|
||||
- Container and dependency updates
|
||||
- Scheduling, staging, and rollback planning
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
A practical update process has four layers:
|
||||
|
||||
- Inventory: know what you run
|
||||
- Detection: know when updates are available
|
||||
- Deployment: apply updates in a controlled order
|
||||
- Validation: confirm services still work
|
||||
|
||||
## Step-by-Step Guide
|
||||
|
||||
### 1. Separate systems by risk
|
||||
|
||||
Create update rings such as:
|
||||
|
||||
- Ring 1: non-critical test systems
|
||||
- Ring 2: internal services
|
||||
- Ring 3: critical stateful services and edge entry points
|
||||
|
||||
### 2. Automate security updates where safe
|
||||
|
||||
For Linux hosts, automated security updates can reduce patch delay for low-risk packages. Review distribution guidance and keep reboots controlled.
|
||||
|
||||
### 3. Automate update discovery
|
||||
|
||||
Use tools that open reviewable pull requests or dashboards for:
|
||||
|
||||
- Container image updates
|
||||
- Dependency updates
|
||||
- Operating system patch reporting
|
||||
|
||||
### 4. Validate after rollout
|
||||
|
||||
Confirm:
|
||||
|
||||
- Service health
|
||||
- Reverse proxy reachability
|
||||
- Backup jobs
|
||||
- Monitoring and alerting
|
||||
|
||||
## Configuration Example
|
||||
|
||||
Ubuntu unattended upgrades example:
|
||||
|
||||
```text
|
||||
APT::Periodic::Update-Package-Lists "1";
|
||||
APT::Periodic::Unattended-Upgrade "1";
|
||||
```
|
||||
|
||||
Dependency update automation example:
|
||||
|
||||
```json
|
||||
{
|
||||
"extends": ["config:recommended"],
|
||||
"schedule": ["before 5am on monday"],
|
||||
"packageRules": [
|
||||
{
|
||||
"matchUpdateTypes": ["major"],
|
||||
"automerge": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting Tips
|
||||
|
||||
### Updates are applied but regressions go unnoticed
|
||||
|
||||
- Add post-update health checks
|
||||
- Review dashboards and key alerts after patch windows
|
||||
- Keep rollback or restore steps documented for stateful services
|
||||
|
||||
### Too many update notifications create fatigue
|
||||
|
||||
- Group low-risk updates into maintenance windows
|
||||
- Separate critical security issues from routine version bumps
|
||||
- Use labels or dashboards to prioritize by service importance
|
||||
|
||||
### Containers stay outdated even though automation exists
|
||||
|
||||
- Verify image digests and registry visibility
|
||||
- Confirm the deployment process actually recreates containers after image updates
|
||||
- Prefer reviewed rebuild and redeploy workflows over blind runtime mutation for important services
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Patch internet-exposed and admin-facing services first
|
||||
- Stage risky or major updates through lower-risk environments
|
||||
- Prefer reviewable dependency automation over silent uncontrolled updates
|
||||
- Keep maintenance windows small and predictable
|
||||
- Document rollback expectations before making large version jumps
|
||||
|
||||
## References
|
||||
|
||||
- [Ubuntu Community Help Wiki: Automatic Security Updates](https://help.ubuntu.com/community/AutomaticSecurityUpdates)
|
||||
- [Debian Wiki: UnattendedUpgrades](https://wiki.debian.org/UnattendedUpgrades)
|
||||
- [Renovate documentation](https://docs.renovatebot.com/)
|
||||
- [GitHub Docs: Configuring Dependabot version updates](https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuring-dependabot-version-updates)
|
||||
Reference in New Issue
Block a user