Rebrand and personalize the bot as 'Bartender' - a companion for those who love deeply and feel intensely. Major changes: - Rename package: daemon_boyfriend -> loyal_companion - New default personality: Bartender - wise, steady, non-judgmental - Grief-aware system prompt (no toxic positivity, attachment-informed) - New relationship levels: New Face -> Close Friend progression - Bartender-style mood modifiers (steady presence) - New fact types: attachment_pattern, grief_context, coping_mechanism - Lower mood decay (0.05) for emotional stability - Higher fact extraction rate (0.4) - Bartender pays attention Updated all imports, configs, Docker files, and documentation.
13 KiB
13 KiB
Fact Extraction System Deep Dive
The fact extraction system autonomously learns facts about users from their conversations with the bot.
Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ Fact Extraction Pipeline │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Rate Limiter (30%) │
│ Only process ~30% of messages │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Extractability Check │
│ - Min 20 chars │
│ - Not a command │
│ - Not just greetings │
│ - Has enough text content │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ AI Fact Extraction │
│ Extracts structured facts │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Deduplication │
│ - Exact match check │
│ - Substring check │
│ - Word overlap check (70%) │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Validation & Storage │
│ Save valid, unique facts │
└──────────────────────────────────────┘
Fact Types
| Type | Description | Examples |
|---|---|---|
hobby |
Activities, interests, pastimes | "loves hiking", "plays guitar" |
work |
Job, career, professional life | "works as a software engineer at Google" |
family |
Family members, relationships | "has two younger sisters" |
preference |
Likes, dislikes, preferences | "prefers dark roast coffee" |
location |
Places they live, visit, are from | "lives in Amsterdam" |
event |
Important life events | "recently got married" |
relationship |
Personal relationships | "has a girlfriend named Sarah" |
general |
Other facts that don't fit | "speaks three languages" |
Fact Attributes
Each extracted fact has:
| Attribute | Type | Description |
|---|---|---|
type |
string | One of the fact types above |
content |
string | The fact itself (third person) |
confidence |
float | How certain the extraction is |
importance |
float | How significant the fact is |
temporal |
string | Time relevance |
Confidence Levels
| Level | Value | When to Use |
|---|---|---|
| Implied | 0.6 | Fact is suggested but not stated |
| Stated | 0.8 | Fact is clearly mentioned |
| Explicit | 1.0 | User directly stated the fact |
Importance Levels
| Level | Value | Description |
|---|---|---|
| Trivial | 0.3 | Minor detail |
| Normal | 0.5 | Standard fact |
| Significant | 0.8 | Important information |
| Very Important | 1.0 | Major life fact |
Temporal Relevance
| Value | Description | Example |
|---|---|---|
past |
Happened before | "used to live in Paris" |
present |
Currently true | "works at Microsoft" |
future |
Planned/expected | "getting married next month" |
timeless |
Always true | "was born in Japan" |
Rate Limiting
To prevent excessive API calls and ensure quality:
# Only attempt extraction on ~30% of messages
if random.random() > settings.fact_extraction_rate:
return [] # Skip this message
Configuration:
FACT_EXTRACTION_RATE= 0.3 (default)- Can be adjusted from 0.0 (disabled) to 1.0 (every message)
Why Rate Limit?
- Reduces AI API costs
- Not every message contains facts
- Prevents redundant extractions
- Spreads learning over time
Extractability Checks
Before sending to AI, messages are filtered:
Minimum Length
MIN_MESSAGE_LENGTH = 20
if len(content) < MIN_MESSAGE_LENGTH:
return False
Alpha Ratio
# Must be at least 50% alphabetic characters
alpha_ratio = sum(c.isalpha() for c in content) / len(content)
if alpha_ratio < 0.5:
return False
Command Detection
# Skip command-like messages
if content.startswith(("!", "/", "?", ".")):
return False
Short Phrase Filter
short_phrases = [
"hi", "hello", "hey", "yo", "sup", "bye", "goodbye",
"thanks", "thank you", "ok", "okay", "yes", "no",
"yeah", "nah", "lol", "lmao", "haha", "hehe",
"nice", "cool", "wow"
]
if content.lower().strip() in short_phrases:
return False
AI Extraction Prompt
The system sends a carefully crafted prompt to the AI:
You are a fact extraction assistant. Extract factual information
about the user from their message.
ALREADY KNOWN FACTS:
- [hobby] loves hiking
- [work] works as senior engineer at Google
RULES:
1. Only extract CONCRETE facts, not opinions or transient states
2. Skip if the fact is already known (listed above)
3. Skip greetings, questions, or meta-conversation
4. Skip vague statements like "I like stuff" - be specific
5. Focus on: hobbies, work, family, preferences, locations, events, relationships
6. Keep fact content concise (under 100 characters)
7. Maximum 3 facts per message
OUTPUT FORMAT:
Return a JSON array of facts, or empty array [] if no extractable facts.
Example Input/Output
Input: "I just got promoted to senior engineer at Google last week!"
Output:
[
{
"type": "work",
"content": "works as senior engineer at Google",
"confidence": 1.0,
"importance": 0.8,
"temporal": "present"
},
{
"type": "event",
"content": "recently got promoted",
"confidence": 1.0,
"importance": 0.7,
"temporal": "past"
}
]
Input: "hey what's up"
Output:
[]
Deduplication
Before saving, facts are checked for duplicates:
1. Exact Match
if new_content.lower() in existing_content:
return True # Is duplicate
2. Substring Check
# If one contains the other (for facts > 10 chars)
if len(new_lower) > 10 and len(existing) > 10:
if new_lower in existing or existing in new_lower:
return True
3. Word Overlap (70% threshold)
new_words = set(new_lower.split())
existing_words = set(existing.split())
if len(new_words) > 2 and len(existing_words) > 2:
overlap = len(new_words & existing_words)
min_len = min(len(new_words), len(existing_words))
if overlap / min_len > 0.7:
return True
Examples:
- "loves hiking" vs "loves hiking" → Duplicate (exact)
- "works as engineer at Google" vs "engineer at Google" → Duplicate (substring)
- "has two younger sisters" vs "has two younger brothers" → Duplicate (70% overlap)
- "loves hiking" vs "enjoys cooking" → Not duplicate
Database Schema
UserFact Table
| Column | Type | Description |
|---|---|---|
id |
Integer | Primary key |
user_id |
Integer | Foreign key to users |
fact_type |
String | Category (hobby, work, etc.) |
fact_content |
String | The fact content |
confidence |
Float | Extraction confidence (0-1) |
source |
String | "auto_extraction" or "manual" |
is_active |
Boolean | Whether fact is still valid |
learned_at |
DateTime | When fact was learned |
category |
String | Same as fact_type |
importance |
Float | Importance level (0-1) |
temporal_relevance |
String | past/present/future/timeless |
extracted_from_message_id |
BigInteger | Discord message ID |
extraction_context |
String | First 200 chars of source message |
API Reference
FactExtractionService
class FactExtractionService:
MIN_MESSAGE_LENGTH = 20
MAX_FACTS_PER_MESSAGE = 3
def __init__(
self,
session: AsyncSession,
ai_service=None
)
async def maybe_extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Rate-limited extraction
async def extract_facts(
self,
user: User,
message_content: str,
discord_message_id: int | None = None,
) -> list[UserFact]
# Direct extraction (no rate limiting)
Configuration
| Variable | Default | Description |
|---|---|---|
FACT_EXTRACTION_ENABLED |
true |
Enable/disable fact extraction |
FACT_EXTRACTION_RATE |
0.3 |
Probability of extraction (0-1) |
Example Usage
from loyal_companion.services.fact_extraction_service import FactExtractionService
async with get_session() as session:
fact_service = FactExtractionService(session, ai_service)
# Rate-limited extraction (recommended for normal use)
new_facts = await fact_service.maybe_extract_facts(
user=user,
message_content="I just started learning Japanese!",
discord_message_id=123456789
)
for fact in new_facts:
print(f"Learned: [{fact.fact_type}] {fact.fact_content}")
# Direct extraction (skips rate limiting)
facts = await fact_service.extract_facts(
user=user,
message_content="I work at Microsoft as a PM"
)
await session.commit()
Manual Fact Addition
Users can also add facts manually:
!remember Command
User: !remember I'm allergic to peanuts
Bot: Got it! I'll remember that you're allergic to peanuts.
These facts have:
source = "manual"instead of"auto_extraction"confidence = 1.0(user stated directly)importance = 0.8(user wanted it remembered)
Admin Command
Admin: !teachbot @user Works night shifts
Bot: Got it! I've noted that about @user.
Fact Retrieval
Facts are used in AI prompts for context:
# Build user context including facts
async def build_user_context(user: User) -> str:
facts = await get_active_facts(user)
context = f"User: {user.custom_name or user.discord_name}\n"
context += "Known facts:\n"
for fact in facts:
context += f"- {fact.fact_content}\n"
return context
Example Context
User: Alex
Known facts:
- works as senior engineer at Google
- loves hiking on weekends
- has two cats named Luna and Stella
- prefers dark roast coffee
- speaks English and Japanese
Design Considerations
Why Third Person?
Facts are stored in third person ("loves hiking" not "I love hiking"):
- Easier to inject into prompts
- Consistent format
- Works in any context
Why Rate Limit?
- Not every message contains facts
- AI API calls are expensive
- Quality over quantity
- Natural learning pace
Why Deduplication?
- Prevents redundant storage
- Keeps fact list clean
- Reduces noise in prompts
- Respects user privacy (one fact = one entry)
Privacy Considerations
- Facts can be viewed with
!whatdoyouknow - Facts can be deleted with
!forgetme - Extraction context is stored (can be audited)
- Source message ID is stored (for reference)