Files
loyal_companion/docs/living-ai/fact-extraction.md
2026-01-13 17:20:52 +00:00

13 KiB

Fact Extraction System Deep Dive

The fact extraction system autonomously learns facts about users from their conversations with the bot.

Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                          Fact Extraction Pipeline                             │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │         Rate Limiter (30%)           │
                    │  Only process ~30% of messages       │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Extractability Check            │
                    │  - Min 20 chars                      │
                    │  - Not a command                     │
                    │  - Not just greetings                │
                    │  - Has enough text content           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      AI Fact Extraction              │
                    │  Extracts structured facts           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Deduplication                   │
                    │  - Exact match check                 │
                    │  - Substring check                   │
                    │  - Word overlap check (70%)          │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Validation & Storage            │
                    │  Save valid, unique facts            │
                    └──────────────────────────────────────┘

Fact Types

Type Description Examples
hobby Activities, interests, pastimes "loves hiking", "plays guitar"
work Job, career, professional life "works as a software engineer at Google"
family Family members, relationships "has two younger sisters"
preference Likes, dislikes, preferences "prefers dark roast coffee"
location Places they live, visit, are from "lives in Amsterdam"
event Important life events "recently got married"
relationship Personal relationships "has a girlfriend named Sarah"
general Other facts that don't fit "speaks three languages"

Fact Attributes

Each extracted fact has:

Attribute Type Description
type string One of the fact types above
content string The fact itself (third person)
confidence float How certain the extraction is
importance float How significant the fact is
temporal string Time relevance

Confidence Levels

Level Value When to Use
Implied 0.6 Fact is suggested but not stated
Stated 0.8 Fact is clearly mentioned
Explicit 1.0 User directly stated the fact

Importance Levels

Level Value Description
Trivial 0.3 Minor detail
Normal 0.5 Standard fact
Significant 0.8 Important information
Very Important 1.0 Major life fact

Temporal Relevance

Value Description Example
past Happened before "used to live in Paris"
present Currently true "works at Microsoft"
future Planned/expected "getting married next month"
timeless Always true "was born in Japan"

Rate Limiting

To prevent excessive API calls and ensure quality:

# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
    return []  # Skip this message

Configuration:

  • FACT_EXTRACTION_RATE = 0.3 (default)
  • Can be adjusted from 0.0 (disabled) to 1.0 (every message)

Why Rate Limit?

  • Reduces AI API costs
  • Not every message contains facts
  • Prevents redundant extractions
  • Spreads learning over time

Extractability Checks

Before sending to AI, messages are filtered:

Minimum Length

MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
    return False

Alpha Ratio

# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
    return False

Command Detection

# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
    return False

Short Phrase Filter

short_phrases = [
    "hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
    "thanks", "thank you", "ok", "okay", "yes", "no",
    "yeah", "nah", "lol", "lmao", "haha", "hehe",
    "nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
    return False

AI Extraction Prompt

The system sends a carefully crafted prompt to the AI:

You are a fact extraction assistant. Extract factual information 
about the user from their message.

ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google

RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message

OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.

Example Input/Output

Input: "I just got promoted to senior engineer at Google last week!"

Output:

[
  {
    "type": "work",
    "content": "works as senior engineer at Google",
    "confidence": 1.0,
    "importance": 0.8,
    "temporal": "present"
  },
  {
    "type": "event",
    "content": "recently got promoted",
    "confidence": 1.0,
    "importance": 0.7,
    "temporal": "past"
  }
]

Input: "hey what's up"

Output:

[]

Deduplication

Before saving, facts are checked for duplicates:

1. Exact Match

if new_content.lower() in existing_content:
    return True  # Is duplicate

2. Substring Check

# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
    if new_lower in existing or existing in new_lower:
        return True

3. Word Overlap (70% threshold)

new_words = set(new_lower.split())
existing_words = set(existing.split())

if len(new_words) > 2 and len(existing_words) > 2:
    overlap = len(new_words & existing_words)
    min_len = min(len(new_words), len(existing_words))
    if overlap / min_len > 0.7:
        return True

Examples:

  • "loves hiking" vs "loves hiking" → Duplicate (exact)
  • "works as engineer at Google" vs "engineer at Google" → Duplicate (substring)
  • "has two younger sisters" vs "has two younger brothers" → Duplicate (70% overlap)
  • "loves hiking" vs "enjoys cooking" → Not duplicate

Database Schema

UserFact Table

Column Type Description
id Integer Primary key
user_id Integer Foreign key to users
fact_type String Category (hobby, work, etc.)
fact_content String The fact content
confidence Float Extraction confidence (0-1)
source String "auto_extraction" or "manual"
is_active Boolean Whether fact is still valid
learned_at DateTime When fact was learned
category String Same as fact_type
importance Float Importance level (0-1)
temporal_relevance String past/present/future/timeless
extracted_from_message_id BigInteger Discord message ID
extraction_context String First 200 chars of source message

API Reference

FactExtractionService

class FactExtractionService:
    MIN_MESSAGE_LENGTH = 20
    MAX_FACTS_PER_MESSAGE = 3
    
    def __init__(
        self, 
        session: AsyncSession, 
        ai_service=None
    )
    
    async def maybe_extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Rate-limited extraction
    
    async def extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Direct extraction (no rate limiting)

Configuration

Variable Default Description
FACT_EXTRACTION_ENABLED true Enable/disable fact extraction
FACT_EXTRACTION_RATE 0.3 Probability of extraction (0-1)

Example Usage

from daemon_boyfriend.services.fact_extraction_service import FactExtractionService

async with get_session() as session:
    fact_service = FactExtractionService(session, ai_service)
    
    # Rate-limited extraction (recommended for normal use)
    new_facts = await fact_service.maybe_extract_facts(
        user=user,
        message_content="I just started learning Japanese!",
        discord_message_id=123456789
    )
    
    for fact in new_facts:
        print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
    
    # Direct extraction (skips rate limiting)
    facts = await fact_service.extract_facts(
        user=user,
        message_content="I work at Microsoft as a PM"
    )
    
    await session.commit()

Manual Fact Addition

Users can also add facts manually:

!remember Command

User: !remember I'm allergic to peanuts

Bot: Got it! I'll remember that you're allergic to peanuts.

These facts have:

  • source = "manual" instead of "auto_extraction"
  • confidence = 1.0 (user stated directly)
  • importance = 0.8 (user wanted it remembered)

Admin Command

Admin: !teachbot @user Works night shifts

Bot: Got it! I've noted that about @user.

Fact Retrieval

Facts are used in AI prompts for context:

# Build user context including facts
async def build_user_context(user: User) -> str:
    facts = await get_active_facts(user)
    
    context = f"User: {user.custom_name or user.discord_name}\n"
    context += "Known facts:\n"
    
    for fact in facts:
        context += f"- {fact.fact_content}\n"
    
    return context

Example Context

User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese

Design Considerations

Why Third Person?

Facts are stored in third person ("loves hiking" not "I love hiking"):

  • Easier to inject into prompts
  • Consistent format
  • Works in any context

Why Rate Limit?

  • Not every message contains facts
  • AI API calls are expensive
  • Quality over quantity
  • Natural learning pace

Why Deduplication?

  • Prevents redundant storage
  • Keeps fact list clean
  • Reduces noise in prompts
  • Respects user privacy (one fact = one entry)

Privacy Considerations

  • Facts can be viewed with !whatdoyouknow
  • Facts can be deleted with !forgetme
  • Extraction context is stored (can be audited)
  • Source message ID is stored (for reference)