442 lines
13 KiB
Markdown
442 lines
13 KiB
Markdown
# Fact Extraction System Deep Dive
|
|
|
|
The fact extraction system autonomously learns facts about users from their conversations with the bot.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ Fact Extraction Pipeline │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│ Rate Limiter (30%) │
|
|
│ Only process ~30% of messages │
|
|
└──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│ Extractability Check │
|
|
│ - Min 20 chars │
|
|
│ - Not a command │
|
|
│ - Not just greetings │
|
|
│ - Has enough text content │
|
|
└──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│ AI Fact Extraction │
|
|
│ Extracts structured facts │
|
|
└──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│ Deduplication │
|
|
│ - Exact match check │
|
|
│ - Substring check │
|
|
│ - Word overlap check (70%) │
|
|
└──────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────┐
|
|
│ Validation & Storage │
|
|
│ Save valid, unique facts │
|
|
└──────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Fact Types
|
|
|
|
| Type | Description | Examples |
|
|
|------|-------------|----------|
|
|
| `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" |
|
|
| `work` | Job, career, professional life | "works as a software engineer at Google" |
|
|
| `family` | Family members, relationships | "has two younger sisters" |
|
|
| `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" |
|
|
| `location` | Places they live, visit, are from | "lives in Amsterdam" |
|
|
| `event` | Important life events | "recently got married" |
|
|
| `relationship` | Personal relationships | "has a girlfriend named Sarah" |
|
|
| `general` | Other facts that don't fit | "speaks three languages" |
|
|
|
|
---
|
|
|
|
## Fact Attributes
|
|
|
|
Each extracted fact has:
|
|
|
|
| Attribute | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `type` | string | One of the fact types above |
|
|
| `content` | string | The fact itself (third person) |
|
|
| `confidence` | float | How certain the extraction is |
|
|
| `importance` | float | How significant the fact is |
|
|
| `temporal` | string | Time relevance |
|
|
|
|
### Confidence Levels
|
|
|
|
| Level | Value | When to Use |
|
|
|-------|-------|-------------|
|
|
| Implied | 0.6 | Fact is suggested but not stated |
|
|
| Stated | 0.8 | Fact is clearly mentioned |
|
|
| Explicit | 1.0 | User directly stated the fact |
|
|
|
|
### Importance Levels
|
|
|
|
| Level | Value | Description |
|
|
|-------|-------|-------------|
|
|
| Trivial | 0.3 | Minor detail |
|
|
| Normal | 0.5 | Standard fact |
|
|
| Significant | 0.8 | Important information |
|
|
| Very Important | 1.0 | Major life fact |
|
|
|
|
### Temporal Relevance
|
|
|
|
| Value | Description | Example |
|
|
|-------|-------------|---------|
|
|
| `past` | Happened before | "used to live in Paris" |
|
|
| `present` | Currently true | "works at Microsoft" |
|
|
| `future` | Planned/expected | "getting married next month" |
|
|
| `timeless` | Always true | "was born in Japan" |
|
|
|
|
---
|
|
|
|
## Rate Limiting
|
|
|
|
To prevent excessive API calls and ensure quality:
|
|
|
|
```python
|
|
# Only attempt extraction on ~30% of messages
|
|
if random.random() > settings.fact_extraction_rate:
|
|
return [] # Skip this message
|
|
```
|
|
|
|
**Configuration:**
|
|
- `FACT_EXTRACTION_RATE` = 0.3 (default)
|
|
- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
|
|
|
|
**Why Rate Limit?**
|
|
- Reduces AI API costs
|
|
- Not every message contains facts
|
|
- Prevents redundant extractions
|
|
- Spreads learning over time
|
|
|
|
---
|
|
|
|
## Extractability Checks
|
|
|
|
Before sending to AI, messages are filtered:
|
|
|
|
### Minimum Length
|
|
```python
|
|
MIN_MESSAGE_LENGTH = 20
|
|
if len(content) < MIN_MESSAGE_LENGTH:
|
|
return False
|
|
```
|
|
|
|
### Alpha Ratio
|
|
```python
|
|
# Must be at least 50% alphabetic characters
|
|
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
|
|
if alpha_ratio < 0.5:
|
|
return False
|
|
```
|
|
|
|
### Command Detection
|
|
```python
|
|
# Skip command-like messages
|
|
if content.startswith(("!", "/", "?", ".")):
|
|
return False
|
|
```
|
|
|
|
### Short Phrase Filter
|
|
```python
|
|
short_phrases = [
|
|
"hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
|
|
"thanks", "thank you", "ok", "okay", "yes", "no",
|
|
"yeah", "nah", "lol", "lmao", "haha", "hehe",
|
|
"nice", "cool", "wow"
|
|
]
|
|
if content.lower().strip() in short_phrases:
|
|
return False
|
|
```
|
|
|
|
---
|
|
|
|
## AI Extraction Prompt
|
|
|
|
The system sends a carefully crafted prompt to the AI:
|
|
|
|
```
|
|
You are a fact extraction assistant. Extract factual information
|
|
about the user from their message.
|
|
|
|
ALREADY KNOWN FACTS:
|
|
- [hobby] loves hiking
|
|
- [work] works as senior engineer at Google
|
|
|
|
RULES:
|
|
1. Only extract CONCRETE facts, not opinions or transient states
|
|
2. Skip if the fact is already known (listed above)
|
|
3. Skip greetings, questions, or meta-conversation
|
|
4. Skip vague statements like "I like stuff" - be specific
|
|
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
|
|
6. Keep fact content concise (under 100 characters)
|
|
7. Maximum 3 facts per message
|
|
|
|
OUTPUT FORMAT:
|
|
Return a JSON array of facts, or empty array [] if no extractable facts.
|
|
```
|
|
|
|
### Example Input/Output
|
|
|
|
**Input:** "I just got promoted to senior engineer at Google last week!"
|
|
|
|
**Output:**
|
|
```json
|
|
[
|
|
{
|
|
"type": "work",
|
|
"content": "works as senior engineer at Google",
|
|
"confidence": 1.0,
|
|
"importance": 0.8,
|
|
"temporal": "present"
|
|
},
|
|
{
|
|
"type": "event",
|
|
"content": "recently got promoted",
|
|
"confidence": 1.0,
|
|
"importance": 0.7,
|
|
"temporal": "past"
|
|
}
|
|
]
|
|
```
|
|
|
|
**Input:** "hey what's up"
|
|
|
|
**Output:**
|
|
```json
|
|
[]
|
|
```
|
|
|
|
---
|
|
|
|
## Deduplication
|
|
|
|
Before saving, facts are checked for duplicates:
|
|
|
|
### 1. Exact Match
|
|
```python
|
|
if new_content.lower() in existing_content:
|
|
return True # Is duplicate
|
|
```
|
|
|
|
### 2. Substring Check
|
|
```python
|
|
# If one contains the other (for facts > 10 chars)
|
|
if len(new_lower) > 10 and len(existing) > 10:
|
|
if new_lower in existing or existing in new_lower:
|
|
return True
|
|
```
|
|
|
|
### 3. Word Overlap (70% threshold)
|
|
```python
|
|
new_words = set(new_lower.split())
|
|
existing_words = set(existing.split())
|
|
|
|
if len(new_words) > 2 and len(existing_words) > 2:
|
|
overlap = len(new_words & existing_words)
|
|
min_len = min(len(new_words), len(existing_words))
|
|
if overlap / min_len > 0.7:
|
|
return True
|
|
```
|
|
|
|
**Examples:**
|
|
- "loves hiking" vs "loves hiking" → **Duplicate** (exact)
|
|
- "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring)
|
|
- "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap)
|
|
- "loves hiking" vs "enjoys cooking" → **Not duplicate**
|
|
|
|
---
|
|
|
|
## Database Schema
|
|
|
|
### UserFact Table
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| `id` | Integer | Primary key |
|
|
| `user_id` | Integer | Foreign key to users |
|
|
| `fact_type` | String | Category (hobby, work, etc.) |
|
|
| `fact_content` | String | The fact content |
|
|
| `confidence` | Float | Extraction confidence (0-1) |
|
|
| `source` | String | "auto_extraction" or "manual" |
|
|
| `is_active` | Boolean | Whether fact is still valid |
|
|
| `learned_at` | DateTime | When fact was learned |
|
|
| `category` | String | Same as fact_type |
|
|
| `importance` | Float | Importance level (0-1) |
|
|
| `temporal_relevance` | String | past/present/future/timeless |
|
|
| `extracted_from_message_id` | BigInteger | Discord message ID |
|
|
| `extraction_context` | String | First 200 chars of source message |
|
|
|
|
---
|
|
|
|
## API Reference
|
|
|
|
### FactExtractionService
|
|
|
|
```python
|
|
class FactExtractionService:
|
|
MIN_MESSAGE_LENGTH = 20
|
|
MAX_FACTS_PER_MESSAGE = 3
|
|
|
|
def __init__(
|
|
self,
|
|
session: AsyncSession,
|
|
ai_service=None
|
|
)
|
|
|
|
async def maybe_extract_facts(
|
|
self,
|
|
user: User,
|
|
message_content: str,
|
|
discord_message_id: int | None = None,
|
|
) -> list[UserFact]
|
|
# Rate-limited extraction
|
|
|
|
async def extract_facts(
|
|
self,
|
|
user: User,
|
|
message_content: str,
|
|
discord_message_id: int | None = None,
|
|
) -> list[UserFact]
|
|
# Direct extraction (no rate limiting)
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction |
|
|
| `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) |
|
|
|
|
---
|
|
|
|
## Example Usage
|
|
|
|
```python
|
|
from daemon_boyfriend.services.fact_extraction_service import FactExtractionService
|
|
|
|
async with get_session() as session:
|
|
fact_service = FactExtractionService(session, ai_service)
|
|
|
|
# Rate-limited extraction (recommended for normal use)
|
|
new_facts = await fact_service.maybe_extract_facts(
|
|
user=user,
|
|
message_content="I just started learning Japanese!",
|
|
discord_message_id=123456789
|
|
)
|
|
|
|
for fact in new_facts:
|
|
print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
|
|
|
|
# Direct extraction (skips rate limiting)
|
|
facts = await fact_service.extract_facts(
|
|
user=user,
|
|
message_content="I work at Microsoft as a PM"
|
|
)
|
|
|
|
await session.commit()
|
|
```
|
|
|
|
---
|
|
|
|
## Manual Fact Addition
|
|
|
|
Users can also add facts manually:
|
|
|
|
### !remember Command
|
|
```
|
|
User: !remember I'm allergic to peanuts
|
|
|
|
Bot: Got it! I'll remember that you're allergic to peanuts.
|
|
```
|
|
|
|
These facts have:
|
|
- `source = "manual"` instead of `"auto_extraction"`
|
|
- `confidence = 1.0` (user stated directly)
|
|
- `importance = 0.8` (user wanted it remembered)
|
|
|
|
### Admin Command
|
|
```
|
|
Admin: !teachbot @user Works night shifts
|
|
|
|
Bot: Got it! I've noted that about @user.
|
|
```
|
|
|
|
---
|
|
|
|
## Fact Retrieval
|
|
|
|
Facts are used in AI prompts for context:
|
|
|
|
```python
|
|
# Build user context including facts
|
|
async def build_user_context(user: User) -> str:
|
|
facts = await get_active_facts(user)
|
|
|
|
context = f"User: {user.custom_name or user.discord_name}\n"
|
|
context += "Known facts:\n"
|
|
|
|
for fact in facts:
|
|
context += f"- {fact.fact_content}\n"
|
|
|
|
return context
|
|
```
|
|
|
|
### Example Context
|
|
```
|
|
User: Alex
|
|
Known facts:
|
|
- works as senior engineer at Google
|
|
- loves hiking on weekends
|
|
- has two cats named Luna and Stella
|
|
- prefers dark roast coffee
|
|
- speaks English and Japanese
|
|
```
|
|
|
|
---
|
|
|
|
## Design Considerations
|
|
|
|
### Why Third Person?
|
|
|
|
Facts are stored in third person ("loves hiking" not "I love hiking"):
|
|
- Easier to inject into prompts
|
|
- Consistent format
|
|
- Works in any context
|
|
|
|
### Why Rate Limit?
|
|
|
|
- Not every message contains facts
|
|
- AI API calls are expensive
|
|
- Quality over quantity
|
|
- Natural learning pace
|
|
|
|
### Why Deduplication?
|
|
|
|
- Prevents redundant storage
|
|
- Keeps fact list clean
|
|
- Reduces noise in prompts
|
|
- Respects user privacy (one fact = one entry)
|
|
|
|
### Privacy Considerations
|
|
|
|
- Facts can be viewed with `!whatdoyouknow`
|
|
- Facts can be deleted with `!forgetme`
|
|
- Extraction context is stored (can be audited)
|
|
- Source message ID is stored (for reference)
|