openrabbit/docs/SECURITY.md

# Security Guidelines for OpenRabbit

This document outlines security best practices and requirements for OpenRabbit development.

## Table of Contents

1. [Workflow Security](#workflow-security)
2. [Webhook Data Handling](#webhook-data-handling)
3. [Input Validation](#input-validation)
4. [Secret Management](#secret-management)
5. [Security Scanning](#security-scanning)
6. [Reporting Vulnerabilities](#reporting-vulnerabilities)

---

## Workflow Security

### Principle: Minimize Data Exposure

**Problem:** GitHub Actions/Gitea Actions can expose sensitive data through:
- Environment variables visible in logs
- Debug output
- Error messages
- Process listings

**Solution:** Use minimal data in workflows and sanitize all inputs.

### ❌ Bad: Exposing Full Webhook Data

```yaml
# NEVER DO THIS - exposes all user data, emails, tokens
env:
  EVENT_JSON: ${{ toJSON(github.event) }}
run: |
  python process.py "$EVENT_JSON"
```

**Why this is dangerous:**
- Full webhook payloads can contain user emails, private repo URLs, installation tokens
- Data appears in workflow logs if debug mode is enabled
- Environment variables can be dumped by malicious code
- Violates principle of least privilege

### ✅ Good: Minimal Data Extraction

```yaml
# SAFE: Only extract necessary fields
run: |
  EVENT_DATA=$(cat <<EOF
  {
    "issue": {
      "number": ${{ github.event.issue.number }}
    },
    "comment": {
      "body": $(echo '${{ github.event.comment.body }}' | jq -Rs .)
    }
  }
  EOF
  )
  python utils/safe_dispatch.py issue_comment "$REPO" "$EVENT_DATA"
```

**Why this is safe:**
- Only includes necessary fields (number, body)
- Agents fetch full data from API with proper auth
- Reduces attack surface
- Follows data minimization principle

### Input Validation Requirements

All workflow inputs MUST be validated before use:

1. **Repository Format**
   ```bash
   # Validate owner/repo format
   if ! echo "$REPO" | grep -qE '^[a-zA-Z0-9_-]+/[a-zA-Z0-9_-]+$'; then
       echo "Error: Invalid repository format"
       exit 1
   fi
   ```

2. **Numeric Inputs**
   ```bash
   # Validate issue/PR numbers are numeric
   if ! [[ "$ISSUE_NUMBER" =~ ^[0-9]+$ ]]; then
       echo "Error: Invalid issue number"
       exit 1
   fi
   ```

3. **String Sanitization**
   ```bash
   # Use jq for JSON string escaping
   BODY=$(echo "$RAW_BODY" | jq -Rs .)
   ```

### Boolean Comparison

```bash
# ❌ WRONG: String comparison on boolean
if [ "$IS_PR" = "true" ]; then

# ✅ CORRECT: Use workflow expression
IS_PR="${{ gitea.event.issue.pull_request != null }}"
if [ "$IS_PR" = "true" ]; then
```

---

## Webhook Data Handling

### Using the Sanitization Utilities

Always use `utils/webhook_sanitizer.py` when handling webhook data:

```python
from utils.webhook_sanitizer import (
    sanitize_webhook_data,
    validate_repository_format,
    extract_minimal_context,
)

# Sanitize data before logging or storing
sanitized = sanitize_webhook_data(raw_event_data)

# Extract only necessary fields
minimal = extract_minimal_context(event_type, sanitized)

# Validate repository input
owner, repo = validate_repository_format(repo_string)
```

### Sensitive Fields (Automatically Redacted)

The sanitizer removes these fields:
- `email`, `private_email`, `email_addresses`
- `token`, `access_token`, `refresh_token`, `api_key`
- `secret`, `password`, `private_key`, `ssh_key`
- `phone`, `address`, `ssn`, `credit_card`
- `installation_id`, `node_id`

### Large Field Truncation

These fields are truncated to prevent log flooding:
- `body`: 500 characters
- `description`: 500 characters
- `message`: 500 characters

---

## Input Validation

### Repository Name Validation

```python
from utils.webhook_sanitizer import validate_repository_format

try:
    owner, repo = validate_repository_format(user_input)
except ValueError as e:
    logger.error(f"Invalid repository: {e}")
    return
```

**Checks performed:**
- Format is `owner/repo`
- No path traversal (`..`)
- No shell injection characters (`;`, `|`, `&`, `` ` ``, etc.)
- Non-empty owner and repo name

### Event Data Size Limits

```python
# Maximum event size: 10MB
MAX_EVENT_SIZE = 10 * 1024 * 1024

if len(event_json) > MAX_EVENT_SIZE:
    raise ValueError("Event data too large")
```

### JSON Validation

```python
try:
    data = json.loads(event_json)
except json.JSONDecodeError as e:
    raise ValueError(f"Invalid JSON: {e}")

if not isinstance(data, dict):
    raise ValueError("Event data must be a JSON object")
```

---

## Secret Management

### Environment Variables

Required secrets (set in CI/CD settings):
- `AI_REVIEW_TOKEN` - Gitea/GitHub API token (read/write access)
- `OPENAI_API_KEY` - OpenAI API key
- `OPENROUTER_API_KEY` - OpenRouter API key (optional)
- `OLLAMA_HOST` - Ollama server URL (optional)

### ❌ Never Commit Secrets

```python
# NEVER DO THIS
api_key = "sk-1234567890abcdef"  # ❌ Hardcoded secret

# NEVER DO THIS
config = {
    "openai_key": "sk-1234567890abcdef"  # ❌ Secret in config
}
```

### ✅ Always Use Environment Variables

```python
# CORRECT
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY not set")
```

### Secret Scanning

The security scanner checks for:
- Hardcoded API keys (pattern: `sk-[a-zA-Z0-9]{32,}`)
- AWS keys (`AKIA[0-9A-Z]{16}`)
- Private keys (`-----BEGIN.*PRIVATE KEY-----`)
- Passwords in code (`password\s*=\s*["'][^"']+["']`)

---

## Security Scanning

### Automated Scanning

All code is scanned for vulnerabilities:

1. **PR Reviews** - Automatic security scan on every PR
2. **Pre-commit Hooks** - Local scanning before commit
3. **Pattern-based Detection** - 17 built-in security rules

### Running Manual Scans

```bash
# Scan a specific file
python -c "
from security.security_scanner import SecurityScanner
s = SecurityScanner()
with open('myfile.py') as f:
    findings = s.scan_content(f.read(), 'myfile.py')
    for f in findings:
        print(f'{f.severity}: {f.description}')
"

# Scan a git diff
git diff | python tools/ai-review/security/scan_diff.py
```

### Security Rule Categories

- **A01: Broken Access Control** - Missing auth, insecure file operations
- **A02: Cryptographic Failures** - Weak crypto, hardcoded secrets
- **A03: Injection** - SQL injection, command injection, XSS
- **A06: Vulnerable Components** - Insecure imports
- **A07: Authentication Failures** - Weak auth mechanisms
- **A09: Logging Failures** - Security logging issues

### Severity Levels

- **HIGH**: Critical vulnerabilities requiring immediate fix
  - SQL injection, command injection, hardcoded secrets

- **MEDIUM**: Important issues requiring attention
  - Missing input validation, weak crypto, XSS

- **LOW**: Best practice violations
  - TODO comments with security keywords, eval() usage

### CI Failure Threshold

Configure in `config.yml`:

```yaml
review:
  fail_on_severity: HIGH  # Fail CI if HIGH severity found
```

---

## Webhook Signature Validation

### Future GitHub Integration

When accepting webhooks directly (not through Gitea Actions):

```python
from utils.webhook_sanitizer import validate_webhook_signature

# Validate webhook is from GitHub
signature = request.headers.get("X-Hub-Signature-256")
payload = request.get_data(as_text=True)
secret = os.environ["WEBHOOK_SECRET"]

if not validate_webhook_signature(payload, signature, secret):
    return "Unauthorized", 401
```

**Important:** Always validate webhook signatures to prevent:
- Replay attacks
- Forged webhook events
- Unauthorized access

---

## Reporting Vulnerabilities

### Security Issues

If you discover a security vulnerability:

1. **DO NOT** create a public issue
2. Email security contact: [maintainer email]
3. Include:
   - Description of the vulnerability
   - Steps to reproduce
   - Potential impact
   - Suggested fix (if available)

### Response Timeline

- **Acknowledgment**: Within 48 hours
- **Initial Assessment**: Within 1 week
- **Fix Development**: Depends on severity
  - HIGH: Within 1 week
  - MEDIUM: Within 2 weeks
  - LOW: Next release cycle

---

## Security Checklist for Contributors

Before submitting a PR:

- [ ] No secrets in code or config files
- [ ] All user inputs are validated
- [ ] No SQL injection vulnerabilities
- [ ] No command injection vulnerabilities
- [ ] No XSS vulnerabilities
- [ ] Sensitive data is sanitized before logging
- [ ] Environment variables are not exposed in workflows
- [ ] Repository format validation is used
- [ ] Error messages don't leak sensitive info
- [ ] Security scanner passes (no HIGH severity)

---

## Security Tools

### Webhook Sanitizer

Location: `tools/ai-review/utils/webhook_sanitizer.py`

Functions:
- `sanitize_webhook_data(data)` - Remove sensitive fields
- `extract_minimal_context(event_type, data)` - Minimal payload
- `validate_repository_format(repo)` - Validate owner/repo
- `validate_webhook_signature(payload, sig, secret)` - Verify webhook

### Safe Dispatch Utility

Location: `tools/ai-review/utils/safe_dispatch.py`

Usage:
```bash
python utils/safe_dispatch.py issue_comment owner/repo '{"action": "created", ...}'
```

Features:
- Input validation
- Size limits (10MB max)
- Automatic sanitization
- Comprehensive error handling

### Security Scanner

Location: `tools/ai-review/security/security_scanner.py`

Features:
- 17 built-in security rules
- OWASP Top 10 coverage
- CWE references
- Severity classification
- Pattern-based detection

---

## Best Practices Summary

1. **Minimize Data**: Only pass necessary data to workflows
2. **Validate Inputs**: Always validate external inputs
3. **Sanitize Outputs**: Remove sensitive data before logging
4. **Use Utilities**: Leverage `webhook_sanitizer.py` and `safe_dispatch.py`
5. **Scan Code**: Run security scanner before committing
6. **Rotate Secrets**: Regularly rotate API keys and tokens
7. **Review Changes**: Manual security review for sensitive changes
8. **Test Security**: Add tests for security-critical code

---

## Updates and Maintenance

This security policy is reviewed quarterly and updated as needed.

**Last Updated**: 2025-12-28
**Next Review**: 2026-03-28