Files
loyal_companion/docs/living-ai/fact-extraction.md
2026-01-13 17:20:52 +00:00

442 lines
13 KiB
Markdown

# Fact Extraction System Deep Dive
The fact extraction system autonomously learns facts about users from their conversations with the bot.
## Overview
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ Fact Extraction Pipeline │
└──────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────┐
│ Rate Limiter (30%) │
│ Only process ~30% of messages │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ Extractability Check │
│ - Min 20 chars │
│ - Not a command │
│ - Not just greetings │
│ - Has enough text content │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ AI Fact Extraction │
│ Extracts structured facts │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ Deduplication │
│ - Exact match check │
│ - Substring check │
│ - Word overlap check (70%) │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│ Validation & Storage │
│ Save valid, unique facts │
└──────────────────────────────────────┘
```
---
## Fact Types
| Type | Description | Examples |
|------|-------------|----------|
| `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" |
| `work` | Job, career, professional life | "works as a software engineer at Google" |
| `family` | Family members, relationships | "has two younger sisters" |
| `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" |
| `location` | Places they live, visit, are from | "lives in Amsterdam" |
| `event` | Important life events | "recently got married" |
| `relationship` | Personal relationships | "has a girlfriend named Sarah" |
| `general` | Other facts that don't fit | "speaks three languages" |
---
## Fact Attributes
Each extracted fact has:
| Attribute | Type | Description |
|-----------|------|-------------|
| `type` | string | One of the fact types above |
| `content` | string | The fact itself (third person) |
| `confidence` | float | How certain the extraction is |
| `importance` | float | How significant the fact is |
| `temporal` | string | Time relevance |
### Confidence Levels
| Level | Value | When to Use |
|-------|-------|-------------|
| Implied | 0.6 | Fact is suggested but not stated |
| Stated | 0.8 | Fact is clearly mentioned |
| Explicit | 1.0 | User directly stated the fact |
### Importance Levels
| Level | Value | Description |
|-------|-------|-------------|
| Trivial | 0.3 | Minor detail |
| Normal | 0.5 | Standard fact |
| Significant | 0.8 | Important information |
| Very Important | 1.0 | Major life fact |
### Temporal Relevance
| Value | Description | Example |
|-------|-------------|---------|
| `past` | Happened before | "used to live in Paris" |
| `present` | Currently true | "works at Microsoft" |
| `future` | Planned/expected | "getting married next month" |
| `timeless` | Always true | "was born in Japan" |
---
## Rate Limiting
To prevent excessive API calls and ensure quality:
```python
# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
return [] # Skip this message
```
**Configuration:**
- `FACT_EXTRACTION_RATE` = 0.3 (default)
- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
**Why Rate Limit?**
- Reduces AI API costs
- Not every message contains facts
- Prevents redundant extractions
- Spreads learning over time
---
## Extractability Checks
Before sending to AI, messages are filtered:
### Minimum Length
```python
MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
return False
```
### Alpha Ratio
```python
# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
return False
```
### Command Detection
```python
# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
return False
```
### Short Phrase Filter
```python
short_phrases = [
"hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
"thanks", "thank you", "ok", "okay", "yes", "no",
"yeah", "nah", "lol", "lmao", "haha", "hehe",
"nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
return False
```
---
## AI Extraction Prompt
The system sends a carefully crafted prompt to the AI:
```
You are a fact extraction assistant. Extract factual information
about the user from their message.
ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google
RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message
OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.
```
### Example Input/Output
**Input:** "I just got promoted to senior engineer at Google last week!"
**Output:**
```json
[
{
"type": "work",
"content": "works as senior engineer at Google",
"confidence": 1.0,
"importance": 0.8,
"temporal": "present"
},
{
"type": "event",
"content": "recently got promoted",
"confidence": 1.0,
"importance": 0.7,
"temporal": "past"
}
]
```
**Input:** "hey what's up"
**Output:**
```json
[]
```
---
## Deduplication
Before saving, facts are checked for duplicates:
### 1. Exact Match
```python
if new_content.lower() in existing_content:
return True # Is duplicate
```
### 2. Substring Check
```python
# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
if new_lower in existing or existing in new_lower:
return True
```
### 3. Word Overlap (70% threshold)
```python
new_words = set(new_lower.split())
existing_words = set(existing.split())
if len(new_words) > 2 and len(existing_words) > 2:
overlap = len(new_words & existing_words)
min_len = min(len(new_words), len(existing_words))
if overlap / min_len > 0.7:
return True
```
**Examples:**
- "loves hiking" vs "loves hiking" → **Duplicate** (exact)
- "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring)
- "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap)
- "loves hiking" vs "enjoys cooking" → **Not duplicate**
---
## Database Schema
### UserFact Table
| Column | Type | Description |
|--------|------|-------------|
| `id` | Integer | Primary key |
| `user_id` | Integer | Foreign key to users |
| `fact_type` | String | Category (hobby, work, etc.) |
| `fact_content` | String | The fact content |
| `confidence` | Float | Extraction confidence (0-1) |
| `source` | String | "auto_extraction" or "manual" |
| `is_active` | Boolean | Whether fact is still valid |
| `learned_at` | DateTime | When fact was learned |
| `category` | String | Same as fact_type |
| `importance` | Float | Importance level (0-1) |
| `temporal_relevance` | String | past/present/future/timeless |
| `extracted_from_message_id` | BigInteger | Discord message ID |
| `extraction_context` | String | First 200 chars of source message |
---
## API Reference
### FactExtractionService
```python
class FactExtractionService:
MIN_MESSAGE_LENGTH = 20
MAX_FACTS_PER_MESSAGE = 3
def __init__(
self,
session: AsyncSession,
ai_service=None
)
async def maybe_extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Rate-limited extraction
async def extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Direct extraction (no rate limiting)
```
---
## Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction |
| `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) |
---
## Example Usage
```python
from daemon_boyfriend.services.fact_extraction_service import FactExtractionService
async with get_session() as session:
fact_service = FactExtractionService(session, ai_service)
# Rate-limited extraction (recommended for normal use)
new_facts = await fact_service.maybe_extract_facts(
user=user,
message_content="I just started learning Japanese!",
discord_message_id=123456789
)
for fact in new_facts:
print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
# Direct extraction (skips rate limiting)
facts = await fact_service.extract_facts(
user=user,
message_content="I work at Microsoft as a PM"
)
await session.commit()
```
---
## Manual Fact Addition
Users can also add facts manually:
### !remember Command
```
User: !remember I'm allergic to peanuts
Bot: Got it! I'll remember that you're allergic to peanuts.
```
These facts have:
- `source = "manual"` instead of `"auto_extraction"`
- `confidence = 1.0` (user stated directly)
- `importance = 0.8` (user wanted it remembered)
### Admin Command
```
Admin: !teachbot @user Works night shifts
Bot: Got it! I've noted that about @user.
```
---
## Fact Retrieval
Facts are used in AI prompts for context:
```python
# Build user context including facts
async def build_user_context(user: User) -> str:
facts = await get_active_facts(user)
context = f"User: {user.custom_name or user.discord_name}\n"
context += "Known facts:\n"
for fact in facts:
context += f"- {fact.fact_content}\n"
return context
```
### Example Context
```
User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese
```
---
## Design Considerations
### Why Third Person?
Facts are stored in third person ("loves hiking" not "I love hiking"):
- Easier to inject into prompts
- Consistent format
- Works in any context
### Why Rate Limit?
- Not every message contains facts
- AI API calls are expensive
- Quality over quantity
- Natural learning pace
### Why Deduplication?
- Prevents redundant storage
- Keeps fact list clean
- Reduces noise in prompts
- Respects user privacy (one fact = one entry)
### Privacy Considerations
- Facts can be viewed with `!whatdoyouknow`
- Facts can be deleted with `!forgetme`
- Extraction context is stored (can be audited)
- Source message ID is stored (for reference)