Hiddenden/loyal_companion

Fork 0

Files

latte b29822efc7 added technical documentation

2026-01-13 17:20:52 +00:00

13 KiB

Raw Blame History

Fact Extraction System Deep Dive

The fact extraction system autonomously learns facts about users from their conversations with the bot.

Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                          Fact Extraction Pipeline                             │
└──────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │         Rate Limiter (30%)           │
                    │  Only process ~30% of messages       │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Extractability Check            │
                    │  - Min 20 chars                      │
                    │  - Not a command                     │
                    │  - Not just greetings                │
                    │  - Has enough text content           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      AI Fact Extraction              │
                    │  Extracts structured facts           │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Deduplication                   │
                    │  - Exact match check                 │
                    │  - Substring check                   │
                    │  - Word overlap check (70%)          │
                    └──────────────────────────────────────┘
                                    │
                                    ▼
                    ┌──────────────────────────────────────┐
                    │      Validation & Storage            │
                    │  Save valid, unique facts            │
                    └──────────────────────────────────────┘

Fact Types

Type	Description	Examples
`hobby`	Activities, interests, pastimes	"loves hiking", "plays guitar"
`work`	Job, career, professional life	"works as a software engineer at Google"
`family`	Family members, relationships	"has two younger sisters"
`preference`	Likes, dislikes, preferences	"prefers dark roast coffee"
`location`	Places they live, visit, are from	"lives in Amsterdam"
`event`	Important life events	"recently got married"
`relationship`	Personal relationships	"has a girlfriend named Sarah"
`general`	Other facts that don't fit	"speaks three languages"

Fact Attributes

Each extracted fact has:

Attribute	Type	Description
`type`	string	One of the fact types above
`content`	string	The fact itself (third person)
`confidence`	float	How certain the extraction is
`importance`	float	How significant the fact is
`temporal`	string	Time relevance

Confidence Levels

Level	Value	When to Use
Implied	0.6	Fact is suggested but not stated
Stated	0.8	Fact is clearly mentioned
Explicit	1.0	User directly stated the fact

Importance Levels

Level	Value	Description
Trivial	0.3	Minor detail
Normal	0.5	Standard fact
Significant	0.8	Important information
Very Important	1.0	Major life fact

Temporal Relevance

Value	Description	Example
`past`	Happened before	"used to live in Paris"
`present`	Currently true	"works at Microsoft"
`future`	Planned/expected	"getting married next month"
`timeless`	Always true	"was born in Japan"

Rate Limiting

To prevent excessive API calls and ensure quality:

# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
    return []  # Skip this message

Configuration:

FACT_EXTRACTION_RATE = 0.3 (default)
Can be adjusted from 0.0 (disabled) to 1.0 (every message)

Why Rate Limit?

Reduces AI API costs
Not every message contains facts
Prevents redundant extractions
Spreads learning over time

Extractability Checks

Before sending to AI, messages are filtered:

Minimum Length

MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
    return False

Alpha Ratio

# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
    return False

Command Detection

# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
    return False

Short Phrase Filter

short_phrases = [
    "hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
    "thanks", "thank you", "ok", "okay", "yes", "no",
    "yeah", "nah", "lol", "lmao", "haha", "hehe",
    "nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
    return False

AI Extraction Prompt

The system sends a carefully crafted prompt to the AI:

You are a fact extraction assistant. Extract factual information 
about the user from their message.

ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google

RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message

OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.

Example Input/Output

Input: "I just got promoted to senior engineer at Google last week!"

Output:

[
  {
    "type": "work",
    "content": "works as senior engineer at Google",
    "confidence": 1.0,
    "importance": 0.8,
    "temporal": "present"
  },
  {
    "type": "event",
    "content": "recently got promoted",
    "confidence": 1.0,
    "importance": 0.7,
    "temporal": "past"
  }
]

Input: "hey what's up"

Output:

[]

Deduplication

Before saving, facts are checked for duplicates:

1. Exact Match

if new_content.lower() in existing_content:
    return True  # Is duplicate

2. Substring Check

# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
    if new_lower in existing or existing in new_lower:
        return True

3. Word Overlap (70% threshold)

new_words = set(new_lower.split())
existing_words = set(existing.split())

if len(new_words) > 2 and len(existing_words) > 2:
    overlap = len(new_words & existing_words)
    min_len = min(len(new_words), len(existing_words))
    if overlap / min_len > 0.7:
        return True

Examples:

"loves hiking" vs "loves hiking" → Duplicate (exact)
"works as engineer at Google" vs "engineer at Google" → Duplicate (substring)
"has two younger sisters" vs "has two younger brothers" → Duplicate (70% overlap)
"loves hiking" vs "enjoys cooking" → Not duplicate

Database Schema

UserFact Table

Column	Type	Description
`id`	Integer	Primary key
`user_id`	Integer	Foreign key to users
`fact_type`	String	Category (hobby, work, etc.)
`fact_content`	String	The fact content
`confidence`	Float	Extraction confidence (0-1)
`source`	String	"auto_extraction" or "manual"
`is_active`	Boolean	Whether fact is still valid
`learned_at`	DateTime	When fact was learned
`category`	String	Same as fact_type
`importance`	Float	Importance level (0-1)
`temporal_relevance`	String	past/present/future/timeless
`extracted_from_message_id`	BigInteger	Discord message ID
`extraction_context`	String	First 200 chars of source message

API Reference

FactExtractionService

class FactExtractionService:
    MIN_MESSAGE_LENGTH = 20
    MAX_FACTS_PER_MESSAGE = 3
    
    def __init__(
        self, 
        session: AsyncSession, 
        ai_service=None
    )
    
    async def maybe_extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Rate-limited extraction
    
    async def extract_facts(
        self,
        user: User,
        message_content: str,
        discord_message_id: int | None = None,
    ) -> list[UserFact]
    # Direct extraction (no rate limiting)

Configuration

Variable	Default	Description
`FACT_EXTRACTION_ENABLED`	`true`	Enable/disable fact extraction
`FACT_EXTRACTION_RATE`	`0.3`	Probability of extraction (0-1)

Example Usage

from daemon_boyfriend.services.fact_extraction_service import FactExtractionService

async with get_session() as session:
    fact_service = FactExtractionService(session, ai_service)
    
    # Rate-limited extraction (recommended for normal use)
    new_facts = await fact_service.maybe_extract_facts(
        user=user,
        message_content="I just started learning Japanese!",
        discord_message_id=123456789
    )
    
    for fact in new_facts:
        print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
    
    # Direct extraction (skips rate limiting)
    facts = await fact_service.extract_facts(
        user=user,
        message_content="I work at Microsoft as a PM"
    )
    
    await session.commit()

Manual Fact Addition

Users can also add facts manually:

!remember Command

User: !remember I'm allergic to peanuts

Bot: Got it! I'll remember that you're allergic to peanuts.

These facts have:

source = "manual" instead of "auto_extraction"
confidence = 1.0 (user stated directly)
importance = 0.8 (user wanted it remembered)

Admin Command

Admin: !teachbot @user Works night shifts

Bot: Got it! I've noted that about @user.

Fact Retrieval

Facts are used in AI prompts for context:

# Build user context including facts
async def build_user_context(user: User) -> str:
    facts = await get_active_facts(user)
    
    context = f"User: {user.custom_name or user.discord_name}\n"
    context += "Known facts:\n"
    
    for fact in facts:
        context += f"- {fact.fact_content}\n"
    
    return context

Example Context

User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese

Design Considerations

Why Third Person?

Facts are stored in third person ("loves hiking" not "I love hiking"):

Easier to inject into prompts
Consistent format
Works in any context

Why Rate Limit?

Not every message contains facts
AI API calls are expensive
Quality over quantity
Natural learning pace

Why Deduplication?

Prevents redundant storage
Keeps fact list clean
Reduces noise in prompts
Respects user privacy (one fact = one entry)

Privacy Considerations

Facts can be viewed with !whatdoyouknow
Facts can be deleted with !forgetme
Extraction context is stored (can be audited)
Source message ID is stored (for reference)

13 KiB Raw Blame History