13 KiB
13 KiB
Fact Extraction System Deep Dive
The fact extraction system autonomously learns facts about users from their conversations with the bot.
Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ Fact Extraction Pipeline │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Rate Limiter (30%) │
│ Only process ~30% of messages │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Extractability Check │
│ - Min 20 chars │
│ - Not a command │
│ - Not just greetings │
│ - Has enough text content │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ AI Fact Extraction │
│ Extracts structured facts │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Deduplication │
│ - Exact match check │
│ - Substring check │
│ - Word overlap check (70%) │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Validation & Storage │
│ Save valid, unique facts │
└──────────────────────────────────────┘
Fact Types
| Type | Description | Examples |
|---|---|---|
hobby |
Activities, interests, pastimes | "loves hiking", "plays guitar" |
work |
Job, career, professional life | "works as a software engineer at Google" |
family |
Family members, relationships | "has two younger sisters" |
preference |
Likes, dislikes, preferences | "prefers dark roast coffee" |
location |
Places they live, visit, are from | "lives in Amsterdam" |
event |
Important life events | "recently got married" |
relationship |
Personal relationships | "has a girlfriend named Sarah" |
general |
Other facts that don't fit | "speaks three languages" |
Fact Attributes
Each extracted fact has:
| Attribute | Type | Description |
|---|---|---|
type |
string | One of the fact types above |
content |
string | The fact itself (third person) |
confidence |
float | How certain the extraction is |
importance |
float | How significant the fact is |
temporal |
string | Time relevance |
Confidence Levels
| Level | Value | When to Use |
|---|---|---|
| Implied | 0.6 | Fact is suggested but not stated |
| Stated | 0.8 | Fact is clearly mentioned |
| Explicit | 1.0 | User directly stated the fact |
Importance Levels
| Level | Value | Description |
|---|---|---|
| Trivial | 0.3 | Minor detail |
| Normal | 0.5 | Standard fact |
| Significant | 0.8 | Important information |
| Very Important | 1.0 | Major life fact |
Temporal Relevance
| Value | Description | Example |
|---|---|---|
past |
Happened before | "used to live in Paris" |
present |
Currently true | "works at Microsoft" |
future |
Planned/expected | "getting married next month" |
timeless |
Always true | "was born in Japan" |
Rate Limiting
To prevent excessive API calls and ensure quality:
# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
return [] # Skip this message
Configuration:
FACT_EXTRACTION_RATE= 0.3 (default)- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
Why Rate Limit?
- Reduces AI API costs
- Not every message contains facts
- Prevents redundant extractions
- Spreads learning over time
Extractability Checks
Before sending to AI, messages are filtered:
Minimum Length
MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
return False
Alpha Ratio
# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
return False
Command Detection
# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
return False
Short Phrase Filter
short_phrases = [
"hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
"thanks", "thank you", "ok", "okay", "yes", "no",
"yeah", "nah", "lol", "lmao", "haha", "hehe",
"nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
return False
AI Extraction Prompt
The system sends a carefully crafted prompt to the AI:
You are a fact extraction assistant. Extract factual information
about the user from their message.
ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google
RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message
OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.
Example Input/Output
Input: "I just got promoted to senior engineer at Google last week!"
Output:
[
{
"type": "work",
"content": "works as senior engineer at Google",
"confidence": 1.0,
"importance": 0.8,
"temporal": "present"
},
{
"type": "event",
"content": "recently got promoted",
"confidence": 1.0,
"importance": 0.7,
"temporal": "past"
}
]
Input: "hey what's up"
Output:
[]
Deduplication
Before saving, facts are checked for duplicates:
1. Exact Match
if new_content.lower() in existing_content:
return True # Is duplicate
2. Substring Check
# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
if new_lower in existing or existing in new_lower:
return True
3. Word Overlap (70% threshold)
new_words = set(new_lower.split())
existing_words = set(existing.split())
if len(new_words) > 2 and len(existing_words) > 2:
overlap = len(new_words & existing_words)
min_len = min(len(new_words), len(existing_words))
if overlap / min_len > 0.7:
return True
Examples:
- "loves hiking" vs "loves hiking" → Duplicate (exact)
- "works as engineer at Google" vs "engineer at Google" → Duplicate (substring)
- "has two younger sisters" vs "has two younger brothers" → Duplicate (70% overlap)
- "loves hiking" vs "enjoys cooking" → Not duplicate
Database Schema
UserFact Table
| Column | Type | Description |
|---|---|---|
id |
Integer | Primary key |
user_id |
Integer | Foreign key to users |
fact_type |
String | Category (hobby, work, etc.) |
fact_content |
String | The fact content |
confidence |
Float | Extraction confidence (0-1) |
source |
String | "auto_extraction" or "manual" |
is_active |
Boolean | Whether fact is still valid |
learned_at |
DateTime | When fact was learned |
category |
String | Same as fact_type |
importance |
Float | Importance level (0-1) |
temporal_relevance |
String | past/present/future/timeless |
extracted_from_message_id |
BigInteger | Discord message ID |
extraction_context |
String | First 200 chars of source message |
API Reference
FactExtractionService
class FactExtractionService:
MIN_MESSAGE_LENGTH = 20
MAX_FACTS_PER_MESSAGE = 3
def __init__(
self,
session: AsyncSession,
ai_service=None
)
async def maybe_extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Rate-limited extraction
async def extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Direct extraction (no rate limiting)
Configuration
| Variable | Default | Description |
|---|---|---|
FACT_EXTRACTION_ENABLED |
true |
Enable/disable fact extraction |
FACT_EXTRACTION_RATE |
0.3 |
Probability of extraction (0-1) |
Example Usage
from daemon_boyfriend.services.fact_extraction_service import FactExtractionService
async with get_session() as session:
fact_service = FactExtractionService(session, ai_service)
# Rate-limited extraction (recommended for normal use)
new_facts = await fact_service.maybe_extract_facts(
user=user,
message_content="I just started learning Japanese!",
discord_message_id=123456789
)
for fact in new_facts:
print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
# Direct extraction (skips rate limiting)
facts = await fact_service.extract_facts(
user=user,
message_content="I work at Microsoft as a PM"
)
await session.commit()
Manual Fact Addition
Users can also add facts manually:
!remember Command
User: !remember I'm allergic to peanuts
Bot: Got it! I'll remember that you're allergic to peanuts.
These facts have:
source = "manual"instead of"auto_extraction"confidence = 1.0(user stated directly)importance = 0.8(user wanted it remembered)
Admin Command
Admin: !teachbot @user Works night shifts
Bot: Got it! I've noted that about @user.
Fact Retrieval
Facts are used in AI prompts for context:
# Build user context including facts
async def build_user_context(user: User) -> str:
facts = await get_active_facts(user)
context = f"User: {user.custom_name or user.discord_name}\n"
context += "Known facts:\n"
for fact in facts:
context += f"- {fact.fact_content}\n"
return context
Example Context
User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese
Design Considerations
Why Third Person?
Facts are stored in third person ("loves hiking" not "I love hiking"):
- Easier to inject into prompts
- Consistent format
- Works in any context
Why Rate Limit?
- Not every message contains facts
- AI API calls are expensive
- Quality over quantity
- Natural learning pace
Why Deduplication?
- Prevents redundant storage
- Keeps fact list clean
- Reduces noise in prompts
- Respects user privacy (one fact = one entry)
Privacy Considerations
- Facts can be viewed with
!whatdoyouknow - Facts can be deleted with
!forgetme - Extraction context is stored (can be audited)
- Source message ID is stored (for reference)