# Fact Extraction System Deep Dive The fact extraction system autonomously learns facts about users from their conversations with the bot. ## Overview ``` ┌──────────────────────────────────────────────────────────────────────────────┐ │ Fact Extraction Pipeline │ └──────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ Rate Limiter (30%) │ │ Only process ~30% of messages │ └──────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ Extractability Check │ │ - Min 20 chars │ │ - Not a command │ │ - Not just greetings │ │ - Has enough text content │ └──────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ AI Fact Extraction │ │ Extracts structured facts │ └──────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ Deduplication │ │ - Exact match check │ │ - Substring check │ │ - Word overlap check (70%) │ └──────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────┐ │ Validation & Storage │ │ Save valid, unique facts │ └──────────────────────────────────────┘ ``` --- ## Fact Types | Type | Description | Examples | |------|-------------|----------| | `hobby` | Activities, interests, pastimes | "loves hiking", "plays guitar" | | `work` | Job, career, professional life | "works as a software engineer at Google" | | `family` | Family members, relationships | "has two younger sisters" | | `preference` | Likes, dislikes, preferences | "prefers dark roast coffee" | | `location` | Places they live, visit, are from | "lives in Amsterdam" | | `event` | Important life events | "recently got married" | | `relationship` | Personal relationships | "has a girlfriend named Sarah" | | `general` | Other facts that don't fit | "speaks three languages" | --- ## Fact Attributes Each extracted fact has: | Attribute | Type | Description | |-----------|------|-------------| | `type` | string | One of the fact types above | | `content` | string | The fact itself (third person) | | `confidence` | float | How certain the extraction is | | `importance` | float | How significant the fact is | | `temporal` | string | Time relevance | ### Confidence Levels | Level | Value | When to Use | |-------|-------|-------------| | Implied | 0.6 | Fact is suggested but not stated | | Stated | 0.8 | Fact is clearly mentioned | | Explicit | 1.0 | User directly stated the fact | ### Importance Levels | Level | Value | Description | |-------|-------|-------------| | Trivial | 0.3 | Minor detail | | Normal | 0.5 | Standard fact | | Significant | 0.8 | Important information | | Very Important | 1.0 | Major life fact | ### Temporal Relevance | Value | Description | Example | |-------|-------------|---------| | `past` | Happened before | "used to live in Paris" | | `present` | Currently true | "works at Microsoft" | | `future` | Planned/expected | "getting married next month" | | `timeless` | Always true | "was born in Japan" | --- ## Rate Limiting To prevent excessive API calls and ensure quality: ```python # Only attempt extraction on ~30% of messages if random.random() > settings.fact_extraction_rate: return [] # Skip this message ``` **Configuration:** - `FACT_EXTRACTION_RATE` = 0.3 (default) - Can be adjusted from 0.0 (disabled) to 1.0 (every message) **Why Rate Limit?** - Reduces AI API costs - Not every message contains facts - Prevents redundant extractions - Spreads learning over time --- ## Extractability Checks Before sending to AI, messages are filtered: ### Minimum Length ```python MIN_MESSAGE_LENGTH = 20 if len(content) < MIN_MESSAGE_LENGTH: return False ``` ### Alpha Ratio ```python # Must be at least 50% alphabetic characters alpha_ratio = sum(c.isalpha() for c in content) / len(content) if alpha_ratio < 0.5: return False ``` ### Command Detection ```python # Skip command-like messages if content.startswith(("!", "/", "?", ".")): return False ``` ### Short Phrase Filter ```python short_phrases = [ "hi", "hello", "hey", "yo", "sup", "bye", "goodbye", "thanks", "thank you", "ok", "okay", "yes", "no", "yeah", "nah", "lol", "lmao", "haha", "hehe", "nice", "cool", "wow" ] if content.lower().strip() in short_phrases: return False ``` --- ## AI Extraction Prompt The system sends a carefully crafted prompt to the AI: ``` You are a fact extraction assistant. Extract factual information about the user from their message. ALREADY KNOWN FACTS: - [hobby] loves hiking - [work] works as senior engineer at Google RULES: 1. Only extract CONCRETE facts, not opinions or transient states 2. Skip if the fact is already known (listed above) 3. Skip greetings, questions, or meta-conversation 4. Skip vague statements like "I like stuff" - be specific 5. Focus on: hobbies, work, family, preferences, locations, events, relationships 6. Keep fact content concise (under 100 characters) 7. Maximum 3 facts per message OUTPUT FORMAT: Return a JSON array of facts, or empty array [] if no extractable facts. ``` ### Example Input/Output **Input:** "I just got promoted to senior engineer at Google last week!" **Output:** ```json [ { "type": "work", "content": "works as senior engineer at Google", "confidence": 1.0, "importance": 0.8, "temporal": "present" }, { "type": "event", "content": "recently got promoted", "confidence": 1.0, "importance": 0.7, "temporal": "past" } ] ``` **Input:** "hey what's up" **Output:** ```json [] ``` --- ## Deduplication Before saving, facts are checked for duplicates: ### 1. Exact Match ```python if new_content.lower() in existing_content: return True # Is duplicate ``` ### 2. Substring Check ```python # If one contains the other (for facts > 10 chars) if len(new_lower) > 10 and len(existing) > 10: if new_lower in existing or existing in new_lower: return True ``` ### 3. Word Overlap (70% threshold) ```python new_words = set(new_lower.split()) existing_words = set(existing.split()) if len(new_words) > 2 and len(existing_words) > 2: overlap = len(new_words & existing_words) min_len = min(len(new_words), len(existing_words)) if overlap / min_len > 0.7: return True ``` **Examples:** - "loves hiking" vs "loves hiking" → **Duplicate** (exact) - "works as engineer at Google" vs "engineer at Google" → **Duplicate** (substring) - "has two younger sisters" vs "has two younger brothers" → **Duplicate** (70% overlap) - "loves hiking" vs "enjoys cooking" → **Not duplicate** --- ## Database Schema ### UserFact Table | Column | Type | Description | |--------|------|-------------| | `id` | Integer | Primary key | | `user_id` | Integer | Foreign key to users | | `fact_type` | String | Category (hobby, work, etc.) | | `fact_content` | String | The fact content | | `confidence` | Float | Extraction confidence (0-1) | | `source` | String | "auto_extraction" or "manual" | | `is_active` | Boolean | Whether fact is still valid | | `learned_at` | DateTime | When fact was learned | | `category` | String | Same as fact_type | | `importance` | Float | Importance level (0-1) | | `temporal_relevance` | String | past/present/future/timeless | | `extracted_from_message_id` | BigInteger | Discord message ID | | `extraction_context` | String | First 200 chars of source message | --- ## API Reference ### FactExtractionService ```python class FactExtractionService: MIN_MESSAGE_LENGTH = 20 MAX_FACTS_PER_MESSAGE = 3 def __init__( self, session: AsyncSession, ai_service=None ) async def maybe_extract_facts( self, user: User, message_content: str, discord_message_id: int | None = None, ) -> list[UserFact] # Rate-limited extraction async def extract_facts( self, user: User, message_content: str, discord_message_id: int | None = None, ) -> list[UserFact] # Direct extraction (no rate limiting) ``` --- ## Configuration | Variable | Default | Description | |----------|---------|-------------| | `FACT_EXTRACTION_ENABLED` | `true` | Enable/disable fact extraction | | `FACT_EXTRACTION_RATE` | `0.3` | Probability of extraction (0-1) | --- ## Example Usage ```python from daemon_boyfriend.services.fact_extraction_service import FactExtractionService async with get_session() as session: fact_service = FactExtractionService(session, ai_service) # Rate-limited extraction (recommended for normal use) new_facts = await fact_service.maybe_extract_facts( user=user, message_content="I just started learning Japanese!", discord_message_id=123456789 ) for fact in new_facts: print(f"Learned: [{fact.fact_type}] {fact.fact_content}") # Direct extraction (skips rate limiting) facts = await fact_service.extract_facts( user=user, message_content="I work at Microsoft as a PM" ) await session.commit() ``` --- ## Manual Fact Addition Users can also add facts manually: ### !remember Command ``` User: !remember I'm allergic to peanuts Bot: Got it! I'll remember that you're allergic to peanuts. ``` These facts have: - `source = "manual"` instead of `"auto_extraction"` - `confidence = 1.0` (user stated directly) - `importance = 0.8` (user wanted it remembered) ### Admin Command ``` Admin: !teachbot @user Works night shifts Bot: Got it! I've noted that about @user. ``` --- ## Fact Retrieval Facts are used in AI prompts for context: ```python # Build user context including facts async def build_user_context(user: User) -> str: facts = await get_active_facts(user) context = f"User: {user.custom_name or user.discord_name}\n" context += "Known facts:\n" for fact in facts: context += f"- {fact.fact_content}\n" return context ``` ### Example Context ``` User: Alex Known facts: - works as senior engineer at Google - loves hiking on weekends - has two cats named Luna and Stella - prefers dark roast coffee - speaks English and Japanese ``` --- ## Design Considerations ### Why Third Person? Facts are stored in third person ("loves hiking" not "I love hiking"): - Easier to inject into prompts - Consistent format - Works in any context ### Why Rate Limit? - Not every message contains facts - AI API calls are expensive - Quality over quantity - Natural learning pace ### Why Deduplication? - Prevents redundant storage - Keeps fact list clean - Reduces noise in prompts - Respects user privacy (one fact = one entry) ### Privacy Considerations - Facts can be viewed with `!whatdoyouknow` - Facts can be deleted with `!forgetme` - Extraction context is stored (can be audited) - Source message ID is stored (for reference)