added technical documentation
This commit is contained in:
441
docs/living-ai/fact-extraction.md
Normal file
441
docs/living-ai/fact-extraction.md
Normal file
@@ -0,0 +1,441 @@
|
||||
# Fact Extraction System Deep Dive
|
||||
|
||||
The fact extraction system autonomously learns facts about users from their conversations with the bot.
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────────┐
|
||||
│ Fact Extraction Pipeline │
|
||||
└──────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ Rate Limiter (30%) │
|
||||
│ Only process ~30% of messages │
|
||||
└──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ Extractability Check │
|
||||
│ - Min 20 chars │
|
||||
│ - Not a command │
|
||||
│ - Not just greetings │
|
||||
│ - Has enough text content │
|
||||
└──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ AI Fact Extraction │
|
||||
│ Extracts structured facts │
|
||||
└──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ Deduplication │
|
||||
│ - Exact match check │
|
||||
│ - Substring check │
|
||||
│ - Word overlap check (70%) │
|
||||
└──────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────┐
|
||||
│ Validation & Storage │
|
||||
│ Save valid, unique facts │
|
||||
└──────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fact Types
|
||||
|
||||
| Type | Description | Examples |
|
||||
|------|-------------|----------|
|
||||
| `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" |
|
||||
| `work` | Job, career, professional life | "works as a software engineer at Google" |
|
||||
| `family` | Family members, relationships | "has two younger sisters" |
|
||||
| `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" |
|
||||
| `location` | Places they live, visit, are from | "lives in Amsterdam" |
|
||||
| `event` | Important life events | "recently got married" |
|
||||
| `relationship` | Personal relationships | "has a girlfriend named Sarah" |
|
||||
| `general` | Other facts that don't fit | "speaks three languages" |
|
||||
|
||||
---
|
||||
|
||||
## Fact Attributes
|
||||
|
||||
Each extracted fact has:
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `type` | string | One of the fact types above |
|
||||
| `content` | string | The fact itself (third person) |
|
||||
| `confidence` | float | How certain the extraction is |
|
||||
| `importance` | float | How significant the fact is |
|
||||
| `temporal` | string | Time relevance |
|
||||
|
||||
### Confidence Levels
|
||||
|
||||
| Level | Value | When to Use |
|
||||
|-------|-------|-------------|
|
||||
| Implied | 0.6 | Fact is suggested but not stated |
|
||||
| Stated | 0.8 | Fact is clearly mentioned |
|
||||
| Explicit | 1.0 | User directly stated the fact |
|
||||
|
||||
### Importance Levels
|
||||
|
||||
| Level | Value | Description |
|
||||
|-------|-------|-------------|
|
||||
| Trivial | 0.3 | Minor detail |
|
||||
| Normal | 0.5 | Standard fact |
|
||||
| Significant | 0.8 | Important information |
|
||||
| Very Important | 1.0 | Major life fact |
|
||||
|
||||
### Temporal Relevance
|
||||
|
||||
| Value | Description | Example |
|
||||
|-------|-------------|---------|
|
||||
| `past` | Happened before | "used to live in Paris" |
|
||||
| `present` | Currently true | "works at Microsoft" |
|
||||
| `future` | Planned/expected | "getting married next month" |
|
||||
| `timeless` | Always true | "was born in Japan" |
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
To prevent excessive API calls and ensure quality:
|
||||
|
||||
```python
|
||||
# Only attempt extraction on ~30% of messages
|
||||
if random.random() > settings.fact_extraction_rate:
|
||||
return [] # Skip this message
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
- `FACT_EXTRACTION_RATE` = 0.3 (default)
|
||||
- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
|
||||
|
||||
**Why Rate Limit?**
|
||||
- Reduces AI API costs
|
||||
- Not every message contains facts
|
||||
- Prevents redundant extractions
|
||||
- Spreads learning over time
|
||||
|
||||
---
|
||||
|
||||
## Extractability Checks
|
||||
|
||||
Before sending to AI, messages are filtered:
|
||||
|
||||
### Minimum Length
|
||||
```python
|
||||
MIN_MESSAGE_LENGTH = 20
|
||||
if len(content) < MIN_MESSAGE_LENGTH:
|
||||
return False
|
||||
```
|
||||
|
||||
### Alpha Ratio
|
||||
```python
|
||||
# Must be at least 50% alphabetic characters
|
||||
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
|
||||
if alpha_ratio < 0.5:
|
||||
return False
|
||||
```
|
||||
|
||||
### Command Detection
|
||||
```python
|
||||
# Skip command-like messages
|
||||
if content.startswith(("!", "/", "?", ".")):
|
||||
return False
|
||||
```
|
||||
|
||||
### Short Phrase Filter
|
||||
```python
|
||||
short_phrases = [
|
||||
"hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
|
||||
"thanks", "thank you", "ok", "okay", "yes", "no",
|
||||
"yeah", "nah", "lol", "lmao", "haha", "hehe",
|
||||
"nice", "cool", "wow"
|
||||
]
|
||||
if content.lower().strip() in short_phrases:
|
||||
return False
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## AI Extraction Prompt
|
||||
|
||||
The system sends a carefully crafted prompt to the AI:
|
||||
|
||||
```
|
||||
You are a fact extraction assistant. Extract factual information
|
||||
about the user from their message.
|
||||
|
||||
ALREADY KNOWN FACTS:
|
||||
- [hobby] loves hiking
|
||||
- [work] works as senior engineer at Google
|
||||
|
||||
RULES:
|
||||
1. Only extract CONCRETE facts, not opinions or transient states
|
||||
2. Skip if the fact is already known (listed above)
|
||||
3. Skip greetings, questions, or meta-conversation
|
||||
4. Skip vague statements like "I like stuff" - be specific
|
||||
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
|
||||
6. Keep fact content concise (under 100 characters)
|
||||
7. Maximum 3 facts per message
|
||||
|
||||
OUTPUT FORMAT:
|
||||
Return a JSON array of facts, or empty array [] if no extractable facts.
|
||||
```
|
||||
|
||||
### Example Input/Output
|
||||
|
||||
**Input:** "I just got promoted to senior engineer at Google last week!"
|
||||
|
||||
**Output:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"type": "work",
|
||||
"content": "works as senior engineer at Google",
|
||||
"confidence": 1.0,
|
||||
"importance": 0.8,
|
||||
"temporal": "present"
|
||||
},
|
||||
{
|
||||
"type": "event",
|
||||
"content": "recently got promoted",
|
||||
"confidence": 1.0,
|
||||
"importance": 0.7,
|
||||
"temporal": "past"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Input:** "hey what's up"
|
||||
|
||||
**Output:**
|
||||
```json
|
||||
[]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deduplication
|
||||
|
||||
Before saving, facts are checked for duplicates:
|
||||
|
||||
### 1. Exact Match
|
||||
```python
|
||||
if new_content.lower() in existing_content:
|
||||
return True # Is duplicate
|
||||
```
|
||||
|
||||
### 2. Substring Check
|
||||
```python
|
||||
# If one contains the other (for facts > 10 chars)
|
||||
if len(new_lower) > 10 and len(existing) > 10:
|
||||
if new_lower in existing or existing in new_lower:
|
||||
return True
|
||||
```
|
||||
|
||||
### 3. Word Overlap (70% threshold)
|
||||
```python
|
||||
new_words = set(new_lower.split())
|
||||
existing_words = set(existing.split())
|
||||
|
||||
if len(new_words) > 2 and len(existing_words) > 2:
|
||||
overlap = len(new_words & existing_words)
|
||||
min_len = min(len(new_words), len(existing_words))
|
||||
if overlap / min_len > 0.7:
|
||||
return True
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
- "loves hiking" vs "loves hiking" → **Duplicate** (exact)
|
||||
- "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring)
|
||||
- "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap)
|
||||
- "loves hiking" vs "enjoys cooking" → **Not duplicate**
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
### UserFact Table
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `id` | Integer | Primary key |
|
||||
| `user_id` | Integer | Foreign key to users |
|
||||
| `fact_type` | String | Category (hobby, work, etc.) |
|
||||
| `fact_content` | String | The fact content |
|
||||
| `confidence` | Float | Extraction confidence (0-1) |
|
||||
| `source` | String | "auto_extraction" or "manual" |
|
||||
| `is_active` | Boolean | Whether fact is still valid |
|
||||
| `learned_at` | DateTime | When fact was learned |
|
||||
| `category` | String | Same as fact_type |
|
||||
| `importance` | Float | Importance level (0-1) |
|
||||
| `temporal_relevance` | String | past/present/future/timeless |
|
||||
| `extracted_from_message_id` | BigInteger | Discord message ID |
|
||||
| `extraction_context` | String | First 200 chars of source message |
|
||||
|
||||
---
|
||||
|
||||
## API Reference
|
||||
|
||||
### FactExtractionService
|
||||
|
||||
```python
|
||||
class FactExtractionService:
|
||||
MIN_MESSAGE_LENGTH = 20
|
||||
MAX_FACTS_PER_MESSAGE = 3
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
session: AsyncSession,
|
||||
ai_service=None
|
||||
)
|
||||
|
||||
async def maybe_extract_facts(
|
||||
self,
|
||||
user: User,
|
||||
message_content: str,
|
||||
discord_message_id: int | None = None,
|
||||
) -> list[UserFact]
|
||||
# Rate-limited extraction
|
||||
|
||||
async def extract_facts(
|
||||
self,
|
||||
user: User,
|
||||
message_content: str,
|
||||
discord_message_id: int | None = None,
|
||||
) -> list[UserFact]
|
||||
# Direct extraction (no rate limiting)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction |
|
||||
| `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) |
|
||||
|
||||
---
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from daemon_boyfriend.services.fact_extraction_service import FactExtractionService
|
||||
|
||||
async with get_session() as session:
|
||||
fact_service = FactExtractionService(session, ai_service)
|
||||
|
||||
# Rate-limited extraction (recommended for normal use)
|
||||
new_facts = await fact_service.maybe_extract_facts(
|
||||
user=user,
|
||||
message_content="I just started learning Japanese!",
|
||||
discord_message_id=123456789
|
||||
)
|
||||
|
||||
for fact in new_facts:
|
||||
print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
|
||||
|
||||
# Direct extraction (skips rate limiting)
|
||||
facts = await fact_service.extract_facts(
|
||||
user=user,
|
||||
message_content="I work at Microsoft as a PM"
|
||||
)
|
||||
|
||||
await session.commit()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Manual Fact Addition
|
||||
|
||||
Users can also add facts manually:
|
||||
|
||||
### !remember Command
|
||||
```
|
||||
User: !remember I'm allergic to peanuts
|
||||
|
||||
Bot: Got it! I'll remember that you're allergic to peanuts.
|
||||
```
|
||||
|
||||
These facts have:
|
||||
- `source = "manual"` instead of `"auto_extraction"`
|
||||
- `confidence = 1.0` (user stated directly)
|
||||
- `importance = 0.8` (user wanted it remembered)
|
||||
|
||||
### Admin Command
|
||||
```
|
||||
Admin: !teachbot @user Works night shifts
|
||||
|
||||
Bot: Got it! I've noted that about @user.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fact Retrieval
|
||||
|
||||
Facts are used in AI prompts for context:
|
||||
|
||||
```python
|
||||
# Build user context including facts
|
||||
async def build_user_context(user: User) -> str:
|
||||
facts = await get_active_facts(user)
|
||||
|
||||
context = f"User: {user.custom_name or user.discord_name}\n"
|
||||
context += "Known facts:\n"
|
||||
|
||||
for fact in facts:
|
||||
context += f"- {fact.fact_content}\n"
|
||||
|
||||
return context
|
||||
```
|
||||
|
||||
### Example Context
|
||||
```
|
||||
User: Alex
|
||||
Known facts:
|
||||
- works as senior engineer at Google
|
||||
- loves hiking on weekends
|
||||
- has two cats named Luna and Stella
|
||||
- prefers dark roast coffee
|
||||
- speaks English and Japanese
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Design Considerations
|
||||
|
||||
### Why Third Person?
|
||||
|
||||
Facts are stored in third person ("loves hiking" not "I love hiking"):
|
||||
- Easier to inject into prompts
|
||||
- Consistent format
|
||||
- Works in any context
|
||||
|
||||
### Why Rate Limit?
|
||||
|
||||
- Not every message contains facts
|
||||
- AI API calls are expensive
|
||||
- Quality over quantity
|
||||
- Natural learning pace
|
||||
|
||||
### Why Deduplication?
|
||||
|
||||
- Prevents redundant storage
|
||||
- Keeps fact list clean
|
||||
- Reduces noise in prompts
|
||||
- Respects user privacy (one fact = one entry)
|
||||
|
||||
### Privacy Considerations
|
||||
|
||||
- Facts can be viewed with `!whatdoyouknow`
|
||||
- Facts can be deleted with `!forgetme`
|
||||
- Extraction context is stored (can be audited)
|
||||
- Source message ID is stored (for reference)
|
||||
Reference in New Issue
Block a user