added technical documentation

2026-01-13 17:20:52 +00:00
parent ff394c9250
commit b29822efc7
12 changed files with 4953 additions and 1 deletions
--- a/docs/living-ai/fact-extraction.md
+++ b/docs/living-ai/fact-extraction.md
@@ -0,0 +1,441 @@
+# Fact Extraction System Deep Dive
+
+The fact extraction system autonomously learns facts about users from their conversations with the bot.
+
+## Overview
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│                          Fact Extraction Pipeline                             │
+└──────────────────────────────────────────────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌──────────────────────────────────────┐
+                    │         Rate Limiter (30%)           │
+                    │  Only process ~30% of messages       │
+                    └──────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌──────────────────────────────────────┐
+                    │      Extractability Check            │
+                    │  - Min 20 chars                      │
+                    │  - Not a command                     │
+                    │  - Not just greetings                │
+                    │  - Has enough text content           │
+                    └──────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌──────────────────────────────────────┐
+                    │      AI Fact Extraction              │
+                    │  Extracts structured facts           │
+                    └──────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌──────────────────────────────────────┐
+                    │      Deduplication                   │
+                    │  - Exact match check                 │
+                    │  - Substring check                   │
+                    │  - Word overlap check (70%)          │
+                    └──────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌──────────────────────────────────────┐
+                    │      Validation & Storage            │
+                    │  Save valid, unique facts            │
+                    └──────────────────────────────────────┘
+```
+
+---
+
+## Fact Types
+
+| Type | Description | Examples |
+|------|-------------|----------|
+| `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" |
+| `work` | Job, career, professional life | "works as a software engineer at Google" |
+| `family` | Family members, relationships | "has two younger sisters" |
+| `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" |
+| `location` | Places they live, visit, are from | "lives in Amsterdam" |
+| `event` | Important life events | "recently got married" |
+| `relationship` | Personal relationships | "has a girlfriend named Sarah" |
+| `general` | Other facts that don't fit | "speaks three languages" |
+
+---
+
+## Fact Attributes
+
+Each extracted fact has:
+
+| Attribute | Type | Description |
+|-----------|------|-------------|
+| `type` | string | One of the fact types above |
+| `content` | string | The fact itself (third person) |
+| `confidence` | float | How certain the extraction is |
+| `importance` | float | How significant the fact is |
+| `temporal` | string | Time relevance |
+
+### Confidence Levels
+
+| Level | Value | When to Use |
+|-------|-------|-------------|
+| Implied | 0.6 | Fact is suggested but not stated |
+| Stated | 0.8 | Fact is clearly mentioned |
+| Explicit | 1.0 | User directly stated the fact |
+
+### Importance Levels
+
+| Level | Value | Description |
+|-------|-------|-------------|
+| Trivial | 0.3 | Minor detail |
+| Normal | 0.5 | Standard fact |
+| Significant | 0.8 | Important information |
+| Very Important | 1.0 | Major life fact |
+
+### Temporal Relevance
+
+| Value | Description | Example |
+|-------|-------------|---------|
+| `past` | Happened before | "used to live in Paris" |
+| `present` | Currently true | "works at Microsoft" |
+| `future` | Planned/expected | "getting married next month" |
+| `timeless` | Always true | "was born in Japan" |
+
+---
+
+## Rate Limiting
+
+To prevent excessive API calls and ensure quality:
+
+```python
+# Only attempt extraction on ~30% of messages
+if random.random() > settings.fact_extraction_rate:
+    return []  # Skip this message
+```
+
+**Configuration:**
+- `FACT_EXTRACTION_RATE` = 0.3 (default)
+- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
+
+**Why Rate Limit?**
+- Reduces AI API costs
+- Not every message contains facts
+- Prevents redundant extractions
+- Spreads learning over time
+
+---
+
+## Extractability Checks
+
+Before sending to AI, messages are filtered:
+
+### Minimum Length
+```python
+MIN_MESSAGE_LENGTH = 20
+if len(content) < MIN_MESSAGE_LENGTH:
+    return False
+```
+
+### Alpha Ratio
+```python
+# Must be at least 50% alphabetic characters
+alpha_ratio = sum(c.isalpha() for c in content) / len(content)
+if alpha_ratio < 0.5:
+    return False
+```
+
+### Command Detection
+```python
+# Skip command-like messages
+if content.startswith(("!", "/", "?", ".")):
+    return False
+```
+
+### Short Phrase Filter
+```python
+short_phrases = [
+    "hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
+    "thanks", "thank you", "ok", "okay", "yes", "no",
+    "yeah", "nah", "lol", "lmao", "haha", "hehe",
+    "nice", "cool", "wow"
+]
+if content.lower().strip() in short_phrases:
+    return False
+```
+
+---
+
+## AI Extraction Prompt
+
+The system sends a carefully crafted prompt to the AI:
+
+```
+You are a fact extraction assistant. Extract factual information 
+about the user from their message.
+
+ALREADY KNOWN FACTS:
+- [hobby] loves hiking
+- [work] works as senior engineer at Google
+
+RULES:
+1. Only extract CONCRETE facts, not opinions or transient states
+2. Skip if the fact is already known (listed above)
+3. Skip greetings, questions, or meta-conversation
+4. Skip vague statements like "I like stuff" - be specific
+5. Focus on: hobbies, work, family, preferences, locations, events, relationships
+6. Keep fact content concise (under 100 characters)
+7. Maximum 3 facts per message
+
+OUTPUT FORMAT:
+Return a JSON array of facts, or empty array [] if no extractable facts.
+```
+
+### Example Input/Output
+
+**Input:** "I just got promoted to senior engineer at Google last week!"
+
+**Output:**
+```json
+[
+  {
+    "type": "work",
+    "content": "works as senior engineer at Google",
+    "confidence": 1.0,
+    "importance": 0.8,
+    "temporal": "present"
+  },
+  {
+    "type": "event",
+    "content": "recently got promoted",
+    "confidence": 1.0,
+    "importance": 0.7,
+    "temporal": "past"
+  }
+]
+```
+
+**Input:** "hey what's up"
+
+**Output:**
+```json
+[]
+```
+
+---
+
+## Deduplication
+
+Before saving, facts are checked for duplicates:
+
+### 1. Exact Match
+```python
+if new_content.lower() in existing_content:
+    return True  # Is duplicate
+```
+
+### 2. Substring Check
+```python
+# If one contains the other (for facts > 10 chars)
+if len(new_lower) > 10 and len(existing) > 10:
+    if new_lower in existing or existing in new_lower:
+        return True
+```
+
+### 3. Word Overlap (70% threshold)
+```python
+new_words = set(new_lower.split())
+existing_words = set(existing.split())
+
+if len(new_words) > 2 and len(existing_words) > 2:
+    overlap = len(new_words & existing_words)
+    min_len = min(len(new_words), len(existing_words))
+    if overlap / min_len > 0.7:
+        return True
+```
+
+**Examples:**
+- "loves hiking" vs "loves hiking" → **Duplicate** (exact)
+- "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring)
+- "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap)
+- "loves hiking" vs "enjoys cooking" → **Not duplicate**
+
+---
+
+## Database Schema
+
+### UserFact Table
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | Integer | Primary key |
+| `user_id` | Integer | Foreign key to users |
+| `fact_type` | String | Category (hobby, work, etc.) |
+| `fact_content` | String | The fact content |
+| `confidence` | Float | Extraction confidence (0-1) |
+| `source` | String | "auto_extraction" or "manual" |
+| `is_active` | Boolean | Whether fact is still valid |
+| `learned_at` | DateTime | When fact was learned |
+| `category` | String | Same as fact_type |
+| `importance` | Float | Importance level (0-1) |
+| `temporal_relevance` | String | past/present/future/timeless |
+| `extracted_from_message_id` | BigInteger | Discord message ID |
+| `extraction_context` | String | First 200 chars of source message |
+
+---
+
+## API Reference
+
+### FactExtractionService
+
+```python
+class FactExtractionService:
+    MIN_MESSAGE_LENGTH = 20
+    MAX_FACTS_PER_MESSAGE = 3
+    
+    def __init__(
+        self, 
+        session: AsyncSession, 
+        ai_service=None
+    )
+    
+    async def maybe_extract_facts(
+        self,
+        user: User,
+        message_content: str,
+        discord_message_id: int | None = None,
+    ) -> list[UserFact]
+    # Rate-limited extraction
+    
+    async def extract_facts(
+        self,
+        user: User,
+        message_content: str,
+        discord_message_id: int | None = None,
+    ) -> list[UserFact]
+    # Direct extraction (no rate limiting)
+```
+
+---
+
+## Configuration
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction |
+| `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) |
+
+---
+
+## Example Usage
+
+```python
+from daemon_boyfriend.services.fact_extraction_service import FactExtractionService
+
+async with get_session() as session:
+    fact_service = FactExtractionService(session, ai_service)
+    
+    # Rate-limited extraction (recommended for normal use)
+    new_facts = await fact_service.maybe_extract_facts(
+        user=user,
+        message_content="I just started learning Japanese!",
+        discord_message_id=123456789
+    )
+    
+    for fact in new_facts:
+        print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
+    
+    # Direct extraction (skips rate limiting)
+    facts = await fact_service.extract_facts(
+        user=user,
+        message_content="I work at Microsoft as a PM"
+    )
+    
+    await session.commit()
+```
+
+---
+
+## Manual Fact Addition
+
+Users can also add facts manually:
+
+### !remember Command
+```
+User: !remember I'm allergic to peanuts
+
+Bot: Got it! I'll remember that you're allergic to peanuts.
+```
+
+These facts have:
+- `source = "manual"` instead of `"auto_extraction"`
+- `confidence = 1.0` (user stated directly)
+- `importance = 0.8` (user wanted it remembered)
+
+### Admin Command
+```
+Admin: !teachbot @user Works night shifts
+
+Bot: Got it! I've noted that about @user.
+```
+
+---
+
+## Fact Retrieval
+
+Facts are used in AI prompts for context:
+
+```python
+# Build user context including facts
+async def build_user_context(user: User) -> str:
+    facts = await get_active_facts(user)
+    
+    context = f"User: {user.custom_name or user.discord_name}\n"
+    context += "Known facts:\n"
+    
+    for fact in facts:
+        context += f"- {fact.fact_content}\n"
+    
+    return context
+```
+
+### Example Context
+```
+User: Alex
+Known facts:
+- works as senior engineer at Google
+- loves hiking on weekends
+- has two cats named Luna and Stella
+- prefers dark roast coffee
+- speaks English and Japanese
+```
+
+---
+
+## Design Considerations
+
+### Why Third Person?
+
+Facts are stored in third person ("loves hiking" not "I love hiking"):
+- Easier to inject into prompts
+- Consistent format
+- Works in any context
+
+### Why Rate Limit?
+
+- Not every message contains facts
+- AI API calls are expensive
+- Quality over quantity
+- Natural learning pace
+
+### Why Deduplication?
+
+- Prevents redundant storage
+- Keeps fact list clean
+- Reduces noise in prompts
+- Respects user privacy (one fact = one entry)
+
+### Privacy Considerations
+
+- Facts can be viewed with `!whatdoyouknow`
+- Facts can be deleted with `!forgetme`
+- Extraction context is stored (can be audited)
+- Source message ID is stored (for reference)