security fixes

2025-12-28 19:55:05 +00:00
parent 4a3ddec68c
commit f94d21580c
15 changed files with 2549 additions and 46 deletions
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -0,0 +1,419 @@
+# Security Guidelines for OpenRabbit
+
+This document outlines security best practices and requirements for OpenRabbit development.
+
+## Table of Contents
+
+1. [Workflow Security](#workflow-security)
+2. [Webhook Data Handling](#webhook-data-handling)
+3. [Input Validation](#input-validation)
+4. [Secret Management](#secret-management)
+5. [Security Scanning](#security-scanning)
+6. [Reporting Vulnerabilities](#reporting-vulnerabilities)
+
+---
+
+## Workflow Security
+
+### Principle: Minimize Data Exposure
+
+**Problem:** GitHub Actions/Gitea Actions can expose sensitive data through:
+- Environment variables visible in logs
+- Debug output
+- Error messages
+- Process listings
+
+**Solution:** Use minimal data in workflows and sanitize all inputs.
+
+### ❌ Bad: Exposing Full Webhook Data
+
+```yaml
+# NEVER DO THIS - exposes all user data, emails, tokens
+env:
+  EVENT_JSON: ${{ toJSON(github.event) }}
+run: |
+  python process.py "$EVENT_JSON"
+```
+
+**Why this is dangerous:**
+- Full webhook payloads can contain user emails, private repo URLs, installation tokens
+- Data appears in workflow logs if debug mode is enabled
+- Environment variables can be dumped by malicious code
+- Violates principle of least privilege
+
+### ✅ Good: Minimal Data Extraction
+
+```yaml
+# SAFE: Only extract necessary fields
+run: |
+  EVENT_DATA=$(cat <<EOF
+  {
+    "issue": {
+      "number": ${{ github.event.issue.number }}
+    },
+    "comment": {
+      "body": $(echo '${{ github.event.comment.body }}' | jq -Rs .)
+    }
+  }
+  EOF
+  )
+  python utils/safe_dispatch.py issue_comment "$REPO" "$EVENT_DATA"
+```
+
+**Why this is safe:**
+- Only includes necessary fields (number, body)
+- Agents fetch full data from API with proper auth
+- Reduces attack surface
+- Follows data minimization principle
+
+### Input Validation Requirements
+
+All workflow inputs MUST be validated before use:
+
+1. **Repository Format**
+   ```bash
+   # Validate owner/repo format
+   if ! echo "$REPO" | grep -qE '^[a-zA-Z0-9_-]+/[a-zA-Z0-9_-]+$'; then
+       echo "Error: Invalid repository format"
+       exit 1
+   fi
+   ```
+
+2. **Numeric Inputs**
+   ```bash
+   # Validate issue/PR numbers are numeric
+   if ! [[ "$ISSUE_NUMBER" =~ ^[0-9]+$ ]]; then
+       echo "Error: Invalid issue number"
+       exit 1
+   fi
+   ```
+
+3. **String Sanitization**
+   ```bash
+   # Use jq for JSON string escaping
+   BODY=$(echo "$RAW_BODY" | jq -Rs .)
+   ```
+
+### Boolean Comparison
+
+```bash
+# ❌ WRONG: String comparison on boolean
+if [ "$IS_PR" = "true" ]; then
+
+# ✅ CORRECT: Use workflow expression
+IS_PR="${{ gitea.event.issue.pull_request != null }}"
+if [ "$IS_PR" = "true" ]; then
+```
+
+---
+
+## Webhook Data Handling
+
+### Using the Sanitization Utilities
+
+Always use `utils/webhook_sanitizer.py` when handling webhook data:
+
+```python
+from utils.webhook_sanitizer import (
+    sanitize_webhook_data,
+    validate_repository_format,
+    extract_minimal_context,
+)
+
+# Sanitize data before logging or storing
+sanitized = sanitize_webhook_data(raw_event_data)
+
+# Extract only necessary fields
+minimal = extract_minimal_context(event_type, sanitized)
+
+# Validate repository input
+owner, repo = validate_repository_format(repo_string)
+```
+
+### Sensitive Fields (Automatically Redacted)
+
+The sanitizer removes these fields:
+- `email`, `private_email`, `email_addresses`
+- `token`, `access_token`, `refresh_token`, `api_key`
+- `secret`, `password`, `private_key`, `ssh_key`
+- `phone`, `address`, `ssn`, `credit_card`
+- `installation_id`, `node_id`
+
+### Large Field Truncation
+
+These fields are truncated to prevent log flooding:
+- `body`: 500 characters
+- `description`: 500 characters
+- `message`: 500 characters
+
+---
+
+## Input Validation
+
+### Repository Name Validation
+
+```python
+from utils.webhook_sanitizer import validate_repository_format
+
+try:
+    owner, repo = validate_repository_format(user_input)
+except ValueError as e:
+    logger.error(f"Invalid repository: {e}")
+    return
+```
+
+**Checks performed:**
+- Format is `owner/repo`
+- No path traversal (`..`)
+- No shell injection characters (`;`, `|`, `&`, `` ` ``, etc.)
+- Non-empty owner and repo name
+
+### Event Data Size Limits
+
+```python
+# Maximum event size: 10MB
+MAX_EVENT_SIZE = 10 * 1024 * 1024
+
+if len(event_json) > MAX_EVENT_SIZE:
+    raise ValueError("Event data too large")
+```
+
+### JSON Validation
+
+```python
+try:
+    data = json.loads(event_json)
+except json.JSONDecodeError as e:
+    raise ValueError(f"Invalid JSON: {e}")
+
+if not isinstance(data, dict):
+    raise ValueError("Event data must be a JSON object")
+```
+
+---
+
+## Secret Management
+
+### Environment Variables
+
+Required secrets (set in CI/CD settings):
+- `AI_REVIEW_TOKEN` - Gitea/GitHub API token (read/write access)
+- `OPENAI_API_KEY` - OpenAI API key
+- `OPENROUTER_API_KEY` - OpenRouter API key (optional)
+- `OLLAMA_HOST` - Ollama server URL (optional)
+
+### ❌ Never Commit Secrets
+
+```python
+# NEVER DO THIS
+api_key = "sk-1234567890abcdef"  # ❌ Hardcoded secret
+
+# NEVER DO THIS
+config = {
+    "openai_key": "sk-1234567890abcdef"  # ❌ Secret in config
+}
+```
+
+### ✅ Always Use Environment Variables
+
+```python
+# CORRECT
+api_key = os.environ.get("OPENAI_API_KEY")
+if not api_key:
+    raise ValueError("OPENAI_API_KEY not set")
+```
+
+### Secret Scanning
+
+The security scanner checks for:
+- Hardcoded API keys (pattern: `sk-[a-zA-Z0-9]{32,}`)
+- AWS keys (`AKIA[0-9A-Z]{16}`)
+- Private keys (`-----BEGIN.*PRIVATE KEY-----`)
+- Passwords in code (`password\s*=\s*["'][^"']+["']`)
+
+---
+
+## Security Scanning
+
+### Automated Scanning
+
+All code is scanned for vulnerabilities:
+
+1. **PR Reviews** - Automatic security scan on every PR
+2. **Pre-commit Hooks** - Local scanning before commit
+3. **Pattern-based Detection** - 17 built-in security rules
+
+### Running Manual Scans
+
+```bash
+# Scan a specific file
+python -c "
+from security.security_scanner import SecurityScanner
+s = SecurityScanner()
+with open('myfile.py') as f:
+    findings = s.scan_content(f.read(), 'myfile.py')
+    for f in findings:
+        print(f'{f.severity}: {f.description}')
+"
+
+# Scan a git diff
+git diff | python tools/ai-review/security/scan_diff.py
+```
+
+### Security Rule Categories
+
+- **A01: Broken Access Control** - Missing auth, insecure file operations
+- **A02: Cryptographic Failures** - Weak crypto, hardcoded secrets
+- **A03: Injection** - SQL injection, command injection, XSS
+- **A06: Vulnerable Components** - Insecure imports
+- **A07: Authentication Failures** - Weak auth mechanisms
+- **A09: Logging Failures** - Security logging issues
+
+### Severity Levels
+
+- **HIGH**: Critical vulnerabilities requiring immediate fix
+  - SQL injection, command injection, hardcoded secrets
+  
+- **MEDIUM**: Important issues requiring attention
+  - Missing input validation, weak crypto, XSS
+  
+- **LOW**: Best practice violations
+  - TODO comments with security keywords, eval() usage
+
+### CI Failure Threshold
+
+Configure in `config.yml`:
+
+```yaml
+review:
+  fail_on_severity: HIGH  # Fail CI if HIGH severity found
+```
+
+---
+
+## Webhook Signature Validation
+
+### Future GitHub Integration
+
+When accepting webhooks directly (not through Gitea Actions):
+
+```python
+from utils.webhook_sanitizer import validate_webhook_signature
+
+# Validate webhook is from GitHub
+signature = request.headers.get("X-Hub-Signature-256")
+payload = request.get_data(as_text=True)
+secret = os.environ["WEBHOOK_SECRET"]
+
+if not validate_webhook_signature(payload, signature, secret):
+    return "Unauthorized", 401
+```
+
+**Important:** Always validate webhook signatures to prevent:
+- Replay attacks
+- Forged webhook events
+- Unauthorized access
+
+---
+
+## Reporting Vulnerabilities
+
+### Security Issues
+
+If you discover a security vulnerability:
+
+1. **DO NOT** create a public issue
+2. Email security contact: [maintainer email]
+3. Include:
+   - Description of the vulnerability
+   - Steps to reproduce
+   - Potential impact
+   - Suggested fix (if available)
+
+### Response Timeline
+
+- **Acknowledgment**: Within 48 hours
+- **Initial Assessment**: Within 1 week
+- **Fix Development**: Depends on severity
+  - HIGH: Within 1 week
+  - MEDIUM: Within 2 weeks
+  - LOW: Next release cycle
+
+---
+
+## Security Checklist for Contributors
+
+Before submitting a PR:
+
+- [ ] No secrets in code or config files
+- [ ] All user inputs are validated
+- [ ] No SQL injection vulnerabilities
+- [ ] No command injection vulnerabilities
+- [ ] No XSS vulnerabilities
+- [ ] Sensitive data is sanitized before logging
+- [ ] Environment variables are not exposed in workflows
+- [ ] Repository format validation is used
+- [ ] Error messages don't leak sensitive info
+- [ ] Security scanner passes (no HIGH severity)
+
+---
+
+## Security Tools
+
+### Webhook Sanitizer
+
+Location: `tools/ai-review/utils/webhook_sanitizer.py`
+
+Functions:
+- `sanitize_webhook_data(data)` - Remove sensitive fields
+- `extract_minimal_context(event_type, data)` - Minimal payload
+- `validate_repository_format(repo)` - Validate owner/repo
+- `validate_webhook_signature(payload, sig, secret)` - Verify webhook
+
+### Safe Dispatch Utility
+
+Location: `tools/ai-review/utils/safe_dispatch.py`
+
+Usage:
+```bash
+python utils/safe_dispatch.py issue_comment owner/repo '{"action": "created", ...}'
+```
+
+Features:
+- Input validation
+- Size limits (10MB max)
+- Automatic sanitization
+- Comprehensive error handling
+
+### Security Scanner
+
+Location: `tools/ai-review/security/security_scanner.py`
+
+Features:
+- 17 built-in security rules
+- OWASP Top 10 coverage
+- CWE references
+- Severity classification
+- Pattern-based detection
+
+---
+
+## Best Practices Summary
+
+1. **Minimize Data**: Only pass necessary data to workflows
+2. **Validate Inputs**: Always validate external inputs
+3. **Sanitize Outputs**: Remove sensitive data before logging
+4. **Use Utilities**: Leverage `webhook_sanitizer.py` and `safe_dispatch.py`
+5. **Scan Code**: Run security scanner before committing
+6. **Rotate Secrets**: Regularly rotate API keys and tokens
+7. **Review Changes**: Manual security review for sensitive changes
+8. **Test Security**: Add tests for security-critical code
+
+---
+
+## Updates and Maintenance
+
+This security policy is reviewed quarterly and updated as needed.
+
+**Last Updated**: 2025-12-28  
+**Next Review**: 2026-03-28