loyal_companion/docs/living-ai/fact-extraction.md

# Fact Extraction System Deep Dive

The fact extraction system autonomously learns facts about users from their conversations with the bot.

## Overview

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                          Fact Extraction Pipeline                             │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │         Rate Limiter (30%)           │
                    │  Only process ~30% of messages       │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Extractability Check            │
                    │  - Min 20 chars                      │
                    │  - Not a command                     │
                    │  - Not just greetings                │
                    │  - Has enough text content           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      AI Fact Extraction              │
                    │  Extracts structured facts           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Deduplication                   │
                    │  - Exact match check                 │
                    │  - Substring check                   │
                    │  - Word overlap check (70%)          │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Validation & Storage            │
                    │  Save valid, unique facts            │
                    └──────────────────────────────────────┘
```

---

## Fact Types

| Type | Description | Examples |
|------|-------------|----------|
| `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" |
| `work` | Job, career, professional life | "works as a software engineer at Google" |
| `family` | Family members, relationships | "has two younger sisters" |
| `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" |
| `location` | Places they live, visit, are from | "lives in Amsterdam" |
| `event` | Important life events | "recently got married" |
| `relationship` | Personal relationships | "has a girlfriend named Sarah" |
| `general` | Other facts that don't fit | "speaks three languages" |

---

## Fact Attributes

Each extracted fact has:

| Attribute | Type | Description |
|-----------|------|-------------|
| `type` | string | One of the fact types above |
| `content` | string | The fact itself (third person) |
| `confidence` | float | How certain the extraction is |
| `importance` | float | How significant the fact is |
| `temporal` | string | Time relevance |

### Confidence Levels

| Level | Value | When to Use |
|-------|-------|-------------|
| Implied | 0.6 | Fact is suggested but not stated |
| Stated | 0.8 | Fact is clearly mentioned |
| Explicit | 1.0 | User directly stated the fact |

### Importance Levels

| Level | Value | Description |
|-------|-------|-------------|
| Trivial | 0.3 | Minor detail |
| Normal | 0.5 | Standard fact |
| Significant | 0.8 | Important information |
| Very Important | 1.0 | Major life fact |

### Temporal Relevance

| Value | Description | Example |
|-------|-------------|---------|
| `past` | Happened before | "used to live in Paris" |
| `present` | Currently true | "works at Microsoft" |
| `future` | Planned/expected | "getting married next month" |
| `timeless` | Always true | "was born in Japan" |

---

## Rate Limiting

To prevent excessive API calls and ensure quality:

```python
# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
    return []  # Skip this message
```

**Configuration:**
- `FACT_EXTRACTION_RATE` = 0.3 (default)
- Can be adjusted from 0.0 (disabled) to 1.0 (every message)

**Why Rate Limit?**
- Reduces AI API costs
- Not every message contains facts
- Prevents redundant extractions
- Spreads learning over time

---

## Extractability Checks

Before sending to AI, messages are filtered:

### Minimum Length
```python
MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
    return False
```

### Alpha Ratio
```python
# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
    return False
```

### Command Detection
```python
# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
    return False
```

### Short Phrase Filter
```python
short_phrases = [
    "hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
    "thanks", "thank you", "ok", "okay", "yes", "no",
    "yeah", "nah", "lol", "lmao", "haha", "hehe",
    "nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
    return False
```

---

## AI Extraction Prompt

The system sends a carefully crafted prompt to the AI:

```
You are a fact extraction assistant. Extract factual information
about the user from their message.

ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google

RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message

OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.
```

### Example Input/Output

**Input:** "I just got promoted to senior engineer at Google last week!"

**Output:**
```json
[
  {
    "type": "work",
    "content": "works as senior engineer at Google",
    "confidence": 1.0,
    "importance": 0.8,
    "temporal": "present"
  },
  {
    "type": "event",
    "content": "recently got promoted",
    "confidence": 1.0,
    "importance": 0.7,
    "temporal": "past"
  }
]
```

**Input:** "hey what's up"

**Output:**
```json
[]
```

---

## Deduplication

Before saving, facts are checked for duplicates:

### 1. Exact Match
```python
if new_content.lower() in existing_content:
    return True  # Is duplicate
```

### 2. Substring Check
```python
# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
    if new_lower in existing or existing in new_lower:
        return True
```

### 3. Word Overlap (70% threshold)
```python
new_words = set(new_lower.split())
existing_words = set(existing.split())

if len(new_words) > 2 and len(existing_words) > 2:
    overlap = len(new_words & existing_words)
    min_len = min(len(new_words), len(existing_words))
    if overlap / min_len > 0.7:
        return True
```

**Examples:**
- "loves hiking" vs "loves hiking" → **Duplicate** (exact)
- "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring)
- "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap)
- "loves hiking" vs "enjoys cooking" → **Not duplicate**

---

## Database Schema

### UserFact Table

| Column | Type | Description |
|--------|------|-------------|
| `id` | Integer | Primary key |
| `user_id` | Integer | Foreign key to users |
| `fact_type` | String | Category (hobby, work, etc.) |
| `fact_content` | String | The fact content |
| `confidence` | Float | Extraction confidence (0-1) |
| `source` | String | "auto_extraction" or "manual" |
| `is_active` | Boolean | Whether fact is still valid |
| `learned_at` | DateTime | When fact was learned |
| `category` | String | Same as fact_type |
| `importance` | Float | Importance level (0-1) |
| `temporal_relevance` | String | past/present/future/timeless |
| `extracted_from_message_id` | BigInteger | Discord message ID |
| `extraction_context` | String | First 200 chars of source message |

---

## API Reference

### FactExtractionService

```python
class FactExtractionService:
    MIN_MESSAGE_LENGTH = 20
    MAX_FACTS_PER_MESSAGE = 3

    def __init__(
        self,
        session: AsyncSession,
        ai_service=None
    )

    async def maybe_extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Rate-limited extraction

    async def extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Direct extraction (no rate limiting)
```

---

## Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction |
| `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) |

---

## Example Usage

```python
from daemon_boyfriend.services.fact_extraction_service import FactExtractionService

async with get_session() as session:
    fact_service = FactExtractionService(session, ai_service)

    # Rate-limited extraction (recommended for normal use)
    new_facts = await fact_service.maybe_extract_facts(
        user=user,
        message_content="I just started learning Japanese!",
        discord_message_id=123456789
    )

    for fact in new_facts:
        print(f"Learned: [{fact.fact_type}] {fact.fact_content}")

    # Direct extraction (skips rate limiting)
    facts = await fact_service.extract_facts(
        user=user,
        message_content="I work at Microsoft as a PM"
    )

    await session.commit()
```

---

## Manual Fact Addition

Users can also add facts manually:

### !remember Command
```
User: !remember I'm allergic to peanuts

Bot: Got it! I'll remember that you're allergic to peanuts.
```

These facts have:
- `source = "manual"` instead of `"auto_extraction"`
- `confidence = 1.0` (user stated directly)
- `importance = 0.8` (user wanted it remembered)

### Admin Command
```
Admin: !teachbot @user Works night shifts

Bot: Got it! I've noted that about @user.
```

---

## Fact Retrieval

Facts are used in AI prompts for context:

```python
# Build user context including facts
async def build_user_context(user: User) -> str:
    facts = await get_active_facts(user)

    context = f"User: {user.custom_name or user.discord_name}\n"
    context += "Known facts:\n"

    for fact in facts:
        context += f"- {fact.fact_content}\n"

    return context
```

### Example Context
```
User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese
```

---

## Design Considerations

### Why Third Person?

Facts are stored in third person ("loves hiking" not "I love hiking"):
- Easier to inject into prompts
- Consistent format
- Works in any context

### Why Rate Limit?

- Not every message contains facts
- AI API calls are expensive
- Quality over quantity
- Natural learning pace

### Why Deduplication?

- Prevents redundant storage
- Keeps fact list clean
- Reduces noise in prompts
- Respects user privacy (one fact = one entry)

### Privacy Considerations

- Facts can be viewed with `!whatdoyouknow`
- Facts can be deleted with `!forgetme`
- Extraction context is stored (can be audited)
- Source message ID is stored (for reference)