first version of the knowledge base :)

2026-03-14 11:41:54 +01:00
commit 27965301ad
47 changed files with 4356 additions and 0 deletions
--- a/Knowledge/infrastructure/monitoring-and-observability.md
+++ b/Knowledge/infrastructure/monitoring-and-observability.md
@@ -0,0 +1,58 @@
+---
+title: Monitoring and Observability
+description: Core concepts behind monitoring, alerting, and observability for self-hosted systems
+tags:
+  - monitoring
+  - observability
+  - operations
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Monitoring and Observability
+
+## Summary
+
+Monitoring and observability provide visibility into system health, failure modes, and operational behavior. For self-hosted systems, they turn infrastructure from a black box into an environment that can be maintained intentionally.
+
+## Why it matters
+
+Without visibility, teams discover failures only after users notice them. Observability reduces diagnosis time, helps verify changes safely, and supports day-two operations such as capacity planning and backup validation.
+
+## Core concepts
+
+- Metrics: numerical measurements over time
+- Logs: event records produced by systems and applications
+- Traces: request-path visibility across components
+- Alerting: notifications triggered by actionable failure conditions
+- Service-level thinking: monitoring what users experience, not only host resource usage
+
+## Practical usage
+
+A practical starting point often includes:
+
+- Host metrics from exporters
+- Availability checks for critical endpoints
+- Dashboards for infrastructure and core services
+- Alerts for outages, storage pressure, certificate expiry, and failed backups
+
+## Best practices
+
+- Monitor both infrastructure health and service reachability
+- Alert on conditions that require action
+- Keep dashboards focused on questions operators actually ask
+- Use monitoring data to validate upgrades and incident recovery
+
+## Pitfalls
+
+- Treating dashboards as a substitute for alerts
+- Collecting far more data than anyone reviews
+- Monitoring only CPU and RAM while ignoring ingress, DNS, and backups
+- Sending noisy alerts that train operators to ignore them
+
+## References
+
+- [Prometheus overview](https://prometheus.io/docs/introduction/overview/)
+- [Prometheus Alertmanager overview](https://prometheus.io/docs/alerting/latest/overview/)
+- [Grafana documentation](https://grafana.com/docs/grafana/latest/)
--- a/Knowledge/infrastructure/proxmox-cluster-basics.md
+++ b/Knowledge/infrastructure/proxmox-cluster-basics.md
@@ -0,0 +1,114 @@
+---
+title: Proxmox Cluster Basics
+description: Overview of how Proxmox VE clusters work, including quorum, networking, and operational constraints
+tags:
+  - proxmox
+  - virtualization
+  - clustering
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Proxmox Cluster Basics
+
+## Introduction
+
+A Proxmox VE cluster groups multiple Proxmox nodes into a shared management domain. This allows centralized administration of virtual machines, containers, storage definitions, and optional high-availability workflows.
+
+## Purpose
+
+Use a Proxmox cluster when you want:
+
+- Centralized management for multiple hypervisor nodes
+- Shared visibility of guests, storage, and permissions
+- Live migration or controlled workload movement between nodes
+- A foundation for HA services backed by shared or replicated storage
+
+## Architecture Overview
+
+A Proxmox cluster relies on several core components:
+
+- `pvecm`: the cluster management tool used to create and join clusters
+- Corosync: provides the cluster communication layer
+- `pmxcfs`: the Proxmox cluster file system used to distribute cluster configuration
+- Quorum: majority voting used to protect cluster consistency
+
+Important operational behavior:
+
+- Each node normally has one vote
+- A majority of votes must be online for state-changing operations
+- Loss of quorum causes the cluster to become read-only for protected operations
+
+## Cluster Design Notes
+
+### Network requirements
+
+Proxmox expects a reliable low-latency network for cluster traffic. Corosync is sensitive to packet loss, jitter, and unstable links. In homelabs, this generally means wired LAN links, stable switching, and avoiding Wi-Fi for cluster communication.
+
+### Odd node counts
+
+Three nodes is the common minimum for a healthy quorum-based design. Two-node designs can work, but they need extra planning such as a QDevice or acceptance of reduced fault tolerance.
+
+### Storage considerations
+
+Clustering does not automatically provide shared storage. Features such as live migration and HA depend on storage design:
+
+- Shared storage: NFS, iSCSI, Ceph, or other shared backends
+- Replicated local storage: possible for some workflows, but requires careful planning
+- Backup storage: separate from guest runtime storage
+
+## Configuration Example
+
+Create a new cluster on the first node:
+
+```bash
+pvecm create lab-cluster
+```
+
+Check cluster status:
+
+```bash
+pvecm status
+```
+
+Join another node to the cluster from that node:
+
+```bash
+pvecm add 192.0.2.10
+```
+
+Use placeholder management addresses in documentation and never expose real administrative IPs publicly.
+
+## Troubleshooting Tips
+
+### Cluster is read-only
+
+- Check quorum status with `pvecm status`
+- Look for network instability between nodes
+- Verify time synchronization and general host health
+
+### Node join fails
+
+- Confirm name resolution and basic IP reachability
+- Make sure cluster traffic is not filtered by a firewall
+- Verify the node is not already part of another cluster
+
+### Random cluster instability
+
+- Review packet loss, duplex mismatches, and switch reliability
+- Keep corosync on stable wired links with low latency
+- Separate heavy storage replication traffic from cluster messaging when possible
+
+## Best Practices
+
+- Use at least three voting members for a stable quorum model
+- Keep cluster traffic on reliable wired networking
+- Document node roles, storage backends, and migration dependencies
+- Treat the Proxmox management network as a high-trust segment
+- Test backup and restore separately from cluster failover assumptions
+
+## References
+
+- [Proxmox VE Administration Guide: Cluster Manager](https://pve.proxmox.com/pve-docs/chapter-pvecm.html)
+- [Proxmox VE `pvecm` manual](https://pve.proxmox.com/pve-docs/pvecm.1.html)
--- a/Knowledge/infrastructure/reverse-proxy-patterns.md
+++ b/Knowledge/infrastructure/reverse-proxy-patterns.md
@@ -0,0 +1,125 @@
+---
+title: Reverse Proxy Patterns
+description: Common reverse proxy design patterns for self-hosted services and internal platforms
+tags:
+  - reverse-proxy
+  - networking
+  - self-hosting
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Reverse Proxy Patterns
+
+## Introduction
+
+A reverse proxy accepts client requests and forwards them to upstream services. It commonly handles TLS termination, host-based routing, request header forwarding, and policy enforcement in front of self-hosted applications.
+
+## Purpose
+
+Reverse proxies are used to:
+
+- Publish multiple services behind one or a few public entry points
+- Centralize TLS certificates
+- Apply authentication, authorization, or rate-limiting controls
+- Simplify backend service placement and migration
+
+## Architecture Overview
+
+Typical request flow:
+
+```text
+Client -> Reverse proxy -> Upstream application
+```
+
+Common proxy responsibilities:
+
+- TLS termination and certificate management
+- Routing by hostname, path, or protocol
+- Forwarding of `Host`, client IP, and other headers
+- Optional load balancing across multiple backends
+
+## Common Patterns
+
+### Edge proxy for many internal services
+
+One proxy handles traffic for multiple hostnames:
+
+- `grafana.example.com`
+- `gitea.example.com`
+- `vault.example.com`
+
+This is a good default for small homelabs and internal platforms.
+
+### Internal proxy behind a VPN
+
+Administrative services are reachable only through a private network such as Tailscale, WireGuard, or a dedicated management VLAN. This reduces public attack surface.
+
+### Path-based routing
+
+Useful when hostnames are limited, but more fragile than host-based routing because some applications assume they live at `/`.
+
+### Dynamic discovery proxy
+
+Tools such as Traefik can watch container metadata and update routes automatically. This reduces manual config for dynamic container environments, but it also makes label hygiene and network policy more important.
+
+## Configuration Example
+
+NGINX example:
+
+```nginx
+server {
+    listen 443 ssl http2;
+    server_name app.example.com;
+
+    location / {
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_pass http://127.0.0.1:8080;
+    }
+}
+```
+
+Caddy example:
+
+```caddyfile
+app.example.com {
+    reverse_proxy 127.0.0.1:8080
+}
+```
+
+## Troubleshooting Tips
+
+### Application redirects to the wrong URL
+
+- Check forwarded headers such as `Host` and `X-Forwarded-Proto`
+- Verify the application's configured external base URL
+- Confirm TLS termination behavior matches application expectations
+
+### WebSocket or streaming traffic fails
+
+- Check proxy support for upgraded connections
+- Review buffering behavior if the application expects streaming responses
+
+### Backend works locally but not through the proxy
+
+- Verify the proxy can reach the upstream host and port
+- Check the proxy network namespace if running in a container
+- Confirm firewall rules permit the proxy-to-upstream path
+
+## Best Practices
+
+- Prefer host-based routing over deep path rewriting
+- Publish only the services that need an edge entry point
+- Keep proxy configuration under version control
+- Use separate internal and public entry points when trust boundaries differ
+- Standardize upstream headers and base URL settings across applications
+
+## References
+
+- [NGINX: Reverse Proxy](https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/)
+- [Traefik: Routing overview](https://doc.traefik.io/traefik/routing/overview/)
+- [Caddy: `reverse_proxy` directive](https://caddyserver.com/docs/caddyfile/directives/reverse_proxy)
--- a/Knowledge/infrastructure/service-architecture-patterns.md
+++ b/Knowledge/infrastructure/service-architecture-patterns.md
@@ -0,0 +1,66 @@
+---
+title: Service Architecture Patterns
+description: Common service architecture patterns for self-hosted platforms and small engineering environments
+tags:
+  - architecture
+  - services
+  - infrastructure
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Service Architecture Patterns
+
+## Summary
+
+Service architecture patterns describe how applications are packaged, connected, exposed, and operated. In self-hosted environments, the most useful patterns balance simplicity, isolation, and operability rather than chasing scale for its own sake.
+
+## Why it matters
+
+Architecture decisions affect deployment complexity, failure domains, recovery steps, and long-term maintenance. Small environments benefit from choosing patterns that remain understandable without full-time platform engineering overhead.
+
+## Core concepts
+
+- Single-service deployment: one service per VM or container stack
+- Shared platform services: DNS, reverse proxy, monitoring, identity, backups
+- Stateful versus stateless workloads
+- Explicit ingress, persistence, and dependency boundaries
+- Loose coupling through DNS, reverse proxies, and documented interfaces
+
+## Practical usage
+
+Useful patterns for self-hosted systems include:
+
+- Reverse proxy plus multiple backend services
+- Dedicated database service with application separation
+- Utility VMs or containers for platform services
+- Private admin interfaces with public application ingress kept separate
+
+Example dependency view:
+
+```text
+Client -> Reverse proxy -> Application -> Database
+                       -> Identity provider
+                       -> Monitoring and logs
+```
+
+## Best practices
+
+- Keep stateful services isolated and clearly backed up
+- Make ingress paths and dependencies easy to trace
+- Reuse shared platform services where they reduce duplication
+- Prefer a small number of well-understood patterns across the environment
+
+## Pitfalls
+
+- Putting every service into one giant stack with unclear boundaries
+- Mixing public ingress and administrative paths without review
+- Scaling architecture complexity before operational need exists
+- Depending on undocumented local assumptions between services
+
+## References
+
+- [Martin Fowler: MonolithFirst](https://martinfowler.com/bliki/MonolithFirst.html)
+- [The Twelve-Factor App](https://12factor.net/)
+- [NGINX: Reverse Proxy](https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/)
--- a/Knowledge/infrastructure/service-discovery.md
+++ b/Knowledge/infrastructure/service-discovery.md
@@ -0,0 +1,122 @@
+---
+title: Service Discovery
+description: Concepts and practical patterns for finding services in self-hosted and homelab environments
+tags:
+  - networking
+  - service-discovery
+  - dns
+category: infrastructure
+created: 2026-03-14
+updated: 2026-03-14
+---
+
+# Service Discovery
+
+## Introduction
+
+Service discovery is the process of locating services by identity instead of hard-coded IP addresses and ports. It becomes more important as workloads move between hosts, IPs change, or multiple service instances exist behind one logical name.
+
+## Purpose
+
+Good service discovery helps with:
+
+- Decoupling applications from fixed network locations
+- Supporting scaling and failover
+- Simplifying service-to-service communication
+- Reducing manual DNS and inventory drift
+
+## Architecture Overview
+
+There are several discovery models commonly used in self-hosted environments:
+
+- Static DNS: manually managed A, AAAA, CNAME, or SRV records
+- DNS-based service discovery: clients query DNS or DNS-SD metadata
+- mDNS: local-link multicast discovery for small LANs
+- Registry-based discovery: a central catalog such as Consul tracks service registration and health
+
+## Discovery Patterns
+
+### Static DNS
+
+Best for stable infrastructure services such as hypervisors, reverse proxies, storage appliances, and monitoring endpoints.
+
+Example:
+
+```text
+proxy.internal.example   A     192.168.20.10
+grafana.internal.example CNAME proxy.internal.example
+```
+
+### DNS-SD and mDNS
+
+Useful for local networks where clients need to discover services such as printers or media endpoints. This works well for small trusted LAN segments, but it does not cross routed boundaries cleanly without extra relays or reflectors.
+
+### Registry-based discovery
+
+A service catalog stores registrations and health checks. Clients query the catalog or use DNS interfaces exposed by the registry.
+
+This is useful when:
+
+- Service instances are dynamic
+- Health-aware routing matters
+- Multiple nodes host the same service
+
+## Configuration Example
+
+Consul service registration example:
+
+```json
+{
+  "service": {
+    "name": "gitea",
+    "port": 3000,
+    "checks": [
+      {
+        "http": "http://127.0.0.1:3000/api/healthz",
+        "interval": "10s"
+      }
+    ]
+  }
+}
+```
+
+DNS-SD example concept:
+
+```text
+_https._tcp.internal.example SRV 0 0 443 proxy.internal.example
+```
+
+## Troubleshooting Tips
+
+### Clients resolve a name but still fail to connect
+
+- Check whether the resolved port is correct
+- Verify firewall policy and reverse proxy routing
+- Confirm the service is healthy, not just registered
+
+### Discovery works on one VLAN but not another
+
+- Review routed DNS access
+- Check whether the workload depends on multicast discovery such as mDNS
+- Avoid relying on broadcast or multicast across segmented networks unless intentionally supported
+
+### Service records become stale
+
+- Use health checks where possible
+- Remove hand-managed DNS entries that no longer match current placements
+- Prefer stable canonical names in front of dynamic backends
+
+## Best Practices
+
+- Use DNS as the default discovery mechanism for stable infrastructure
+- Add service registries only when the environment is dynamic enough to justify them
+- Pair discovery with health checks when multiple instances or failover paths exist
+- Keep discovery names human-readable and environment-specific
+- Avoid hard-coding IP addresses in application configuration unless there is no realistic alternative
+
+## References
+
+- [Consul: Discover services overview](https://developer.hashicorp.com/consul/docs/discover)
+- [Consul: Service discovery explained](https://developer.hashicorp.com/consul/docs/use-case/service-discovery)
+- [RFC 6762: Multicast DNS](https://www.rfc-editor.org/rfc/rfc6762)
+- [RFC 6763: DNS-Based Service Discovery](https://www.rfc-editor.org/rfc/rfc6763)