📄 silent-observer-integration-spec.md 51,261 bytes Apr 30, 2026 📋 Raw

Silent Observer — Brain Intelligence Integration Spec

Status: Draft v1.1 (Director Revisions)
Date: 2026-04-30
Author: Wadsworth (Chief of Staff)
Owner: Socrates (Backend Architecture)
Scope: Phase 8 Integration Layer


Executive Summary

This document specifies the integration architecture between the Silent Observer (Phase 8 — ambient chat monitoring) and the Brain Intelligence (Phase 6/7 — RAG-based knowledge retrieval). Together, they enable zero-UI household coordination where:

  • The Observer listens to family chat without interrupting
  • The Brain provides memory context for decisions
  • Icarus speaks only when there's a real conflict or missing critical variable

Non-Negotiable Constraints (Director-Level):
1. State Protection (No Auto-Writes): All extractions require Human-in-the-Loop [Confirm]
2. Asynchronous Processing: Tier 1 releases chat thread immediately; Tier 2 is async
3. The 'Redline' Speak Rule: Silence is default; speak only on temporal/resource conflicts
4. Staging Environment Only: Connects only to staging.db with Miller Family context


Confidence Field Unification

CRITICAL: Confidence scores are NEVER multiplied or blended. Each stage has independent gates with clear thresholds.

Stage Gates (Sequential, Not Multiplicative)

┌─────────────────────────────────────────────────────────────────────────────┐
                     CONFIDENCE GATE ARCHITECTURE                            
├─────────────────────────────────────────────────────────────────────────────┤
                                                                             
  Stage 1: Tripwire Confidence                                               
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                               
   Calculated by: Regex pattern matcher (Tier 1)                            
   Threshold: 0.7 to proceed to extraction                                
   Logged field: tripwire_confidence (independent, 0.0-1.0)                  
   If <0.7: DROP  message not queued for Tier 2                            
                                                                             
                          tripwire_confidence  0.7                         
                                                                             
  Stage 2: Extraction Confidence                                             
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                             
   Calculated by: LLM extraction model (Tier 2)                             
   Threshold: 0.6 for usable extraction                                    
   Logged field: extraction_confidence (independent, 0.0-1.0)                
   If <0.6: DROP  low-quality extraction, don't query Brain                 
                                                                             
                          extraction_confidence  0.6                       
                                                                             
  Stage 3: Brain Relevance Score                                             
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                             
   Calculated by: Brain RAG retriever                                       
   Threshold: 0.5 to use Brain context in redline decision                
   Logged field: brain_relevance (independent, 0.0-1.0)                      
   If <0.5: Proceed without Brain context (extraction-only redline)          
                                                                             
                          brain_relevance  0.5 (optional)                  
                                                                             
  Stage 4: Redline Decision                                                  
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                               
   Calculated by: Rule-based conflict engine (no ML)                       
   Output: SILENT | SPEAK                                                  
   Logged field: redline_decision, redline_trigger_rule                      
   Does NOT use confidence scores  uses extracted entities only             
                                                                             
└─────────────────────────────────────────────────────────────────────────────┘

Confidence Logging Schema

-- Each confidence is logged independently for debugging
CREATE TABLE observer_confidence_log (
    message_id TEXT NOT NULL,
    stage TEXT NOT NULL,                    -- 'tripwire' | 'extraction' | 'brain'
    confidence_type TEXT,                   -- 'regex_score' | 'llm_score' | 'relevance'
    score REAL NOT NULL,
    threshold REAL NOT NULL,
    passed BOOLEAN NOT NULL,
    decision TEXT NOT NULL,                 -- 'proceed' | 'drop' | 'proceed_no_context'
    logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Anti-Pattern: NEVER Do This

# WRONG — blending scores creates debugging nightmare
final_confidence = tripwire_score * extraction_score * brain_score
if final_confidence > 0.5:  # What failed? Who knows.
    ...

# CORRECT — each gate independent, clear failure point
if tripwire_score < 0.7:
    log("TRIPWIRE_REJECT", score=tripwire_score, threshold=0.7)
    return DROP

if extraction_score < 0.6:
    log("EXTRACTION_REJECT", score=extraction_score, threshold=0.6)
    return DROP

# Brain relevance is optional — proceed with or without
use_brain = brain_relevance >= 0.5

Multi-Message Context Architecture

Status: P0 Priority — Moved to Phase 8.3 (was deferred)

Why: 60% of real coordination happens across multiple messages. Single-message extraction misses critical context.

The Problem

Message A (09:00): "Who's picking up Leo Thursday after soccer?"
Message B (09:15): "I'll do it"
                                    Who is "I"? What activity? What time?
              Without Message A, Message B is unresolvable.

Context Window Specification

MULTI_MESSAGE_CONFIG = {
    "context_window_size": 3,           # Last 3 messages in thread
    "context_ttl_seconds": 300,         # Messages expire after 5 minutes
    "attribution_window": 600,          # Look back 10 min for matching context
    "max_participants": 4,              # Family group size limit
}

Attribution Logic

def resolve_attribution(current_message, context_window):
    """
    Match "I'll do it" to the action in prior messages.
    Returns: resolved_assignment or None
    """

    # Step 1: Extract from current message
    current = extract(current_message)
    if current.get("assigned_to") is not None:
        # Already has assignment (e.g., "John will pick up")
        return current

    # Step 2: Look for pronoun references in current message
    pronouns = ["i'll", "i will", "i can", "i'm", "i am", "me"]
    if not any(p in current_message.text.lower() for p in pronouns):
        # Not an assignment message
        return current

    # Step 3: Search context window for matching coordination
    for prior in reversed(context_window):
        prior_extracted = prior.get("extraction", {})

        # Match criteria:
        # 1. Prior has missing assignment (who's/who is/can someone)
        # 2. Same temporal scope (date overlap)
        # 3. Same activity or child mentioned
        if (
            prior_extracted.get("assigned_to") == "unspecified" and
            dates_overlap(current.get("dates"), prior_extracted.get("dates")) and
            (activities_match(current.get("activity"), prior_extracted.get("activity")) or
             children_match(current.get("child"), prior_extracted.get("child")))
        ):
            # Attribution found
            current["assigned_to"] = current_message.sender  # "I" = sender of Message B
            current["attributed_to_message_id"] = prior["message_id"]
            current["attribution_confidence"] = 0.85
            return current

    # No match found — extraction incomplete
    current["assigned_to"] = "unspecified"
    current["attribution_confidence"] = 0.3
    return current

Context-Aware Tripwire

# Tripwire now operates on message thread, not just single message
def tripwire_with_context(message, context_window):
    # Score current message alone
    base_score = pattern_match_score(message.text)

    # Boost score if context suggests coordination thread
    if is_coordination_thread(context_window):
        # Prior messages contained questions about assignments
        base_score = min(1.0, base_score + 0.15)

    # Boost if current message is short response in context thread
    if (
        len(message.text.split()) <= 5 and  # "I'll do it", "yeah me", "sure"
        is_coordination_thread(context_window)
    ):
        base_score = min(1.0, base_score + 0.20)

    return base_score

Database Schema Updates

-- Track message threading for multi-message context
ALTER TABLE observer_messages ADD COLUMN thread_id TEXT;
ALTER TABLE observer_messages ADD COLUMN context_window TEXT; -- JSON array of prior message_ids
ALTER TABLE observer_messages ADD COLUMN attribution_source TEXT; -- message_id this was resolved from

-- Index for thread lookups
CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at);

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
                           FAMILY CHAT (Telegram Group)                     
                     Members: John, Sarah, Icarus (bot)                     
└─────────────────────────────────────────────────────────────────────────────┘
                                      
                                      
┌─────────────────────────────────────────────────────────────────────────────┐
  TIER 1: TRIPWIRE (Python/Regex)                                            
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                              
   Latency: <10ms                                                            
   Resource: CPU only (Beelink)                                              
   Pattern match: dates, times, coordination keywords                        
   Context-aware: Scans last 3 messages for thread continuity              
   Immediate release of chat thread                                          
                                                                              
  Pattern Categories:                                                        
  ├── Temporal: "tomorrow", "June 4th", "next week", "Tuesday 3pm"          
  ├── Assignment: "I'll pick up", "can you cover", "who's getting"            
  ├── Pronoun Resolution: "I'll do it", "me too", "yeah" (needs context)   
  ├── Children: "Leo", "Mia", "kids", "the children"                        
  ├── Activities: "soccer", "ballet", "swim", "chess", "practice"          
  └── Conflict markers: "but", "wait", "doesn't", "conflict", "overlap"     
└─────────────────────────────────────────────────────────────────────────────┘
                                      
                    ┌─────────────────┴─────────────────┐
                                                       
              NO MATCH (95%)                    MATCH (5%)
         ┌─────────────────────┐              ┌─────────────────────┐
            Drop Silently                    Queue for Tier 2    
            Log to observer_                 (Redis/Queue)       
            messages table                   Return 200 OK to    
            (no processing)                  Telegram            
         └─────────────────────┘              └─────────────────────┘
                                                      
                                                      
┌─────────────────────────────────────────────────────────────────────────────┐
  TIER 2: ASYNC PROCESSING (8B LLM + Brain Query)                           
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                            
   Worker: Celery/Background task (Beelink)                                  
   LLM: phi4:14b or llama3.1:8b via Gaming PC (Tailscale)                  
   Latency: 1-3s (acceptable  async)                                        
   Brain Query: HTTPS to icarus-test.hoffdesk.com/brain/query                
   Context: Fetches last 3 messages for multi-message resolution             
└─────────────────────────────────────────────────────────────────────────────┘
                                      
                                      
┌─────────────────────────────────────────────────────────────────────────────┐
  COORDINATION STATE EXTRACTOR                                              
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                              
                                                                              
  Input: Chat message + Context window (last 3 messages) + Brain context   
  Output: Structured extraction with independent confidence scores            
                                                                              
  {                                                                           
    "is_coordination": true | false,                                        
    "coordination_type": "transport" | "care_coverage" |                    
                         "schedule_change" | "activity_confirm" | null,     
    "dates": ["2026-06-04", "2026-06-05"],                                
    "times": ["15:30", "16:00"],                                          
    "assigned_to": ["john" | "sarah" | "unspecified"],                     
    "child": ["leo" | "mia" | "both" | null],                              
    "activity": "soccer" | "ballet" | "swim" | "chess" | null,              
    "location": "Westside Park" | "Madison Dance Academy" | null,          
    "action_required": "confirm" | "resolve_conflict" | "notify" | null,  
    "extracted_entities": [...],                                            
    "attribution": {                                                        
      "source_message_id": "...",                                          
      "attribution_confidence": 0.85                                        
    },                                                                        
    "brain_query_context": {...} | null,                                   
    "confidence_scores": {                                                   
      "tripwire_confidence": 0.85,                                          
      "extraction_confidence": 0.72,                                        
      "brain_relevance": 0.68                                               
    }                                                                         
  }                                                                           
└─────────────────────────────────────────────────────────────────────────────┘
                                      
                                      
┌─────────────────────────────────────────────────────────────────────────────┐
  REDLINE DECISION ENGINE                                                   
  ━━━━━━━━━━━━━━━━━━━━━━━━                                                  
                                                                              
  ┌─────────────────────────────────────────────────────────────────────┐   
    RULE 1: Temporal Conflict                                             
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━                                           
    Condition: Extracted date/time overlaps with existing Event Graph      
              entry for same child + overlapping times                     
    Confidence gate: extraction_confidence  0.6 (already verified)       
                                                                          
    Action: SPEAK  "⚠️ Leo has soccer at 4:30pm Thursday, but this     │   │
│  │          message mentions a dentist appointment at 4:00pm.           │   │
│  │          Which is correct?"                                            
  └─────────────────────────────────────────────────────────────────────┘   
                                                                              
  ┌─────────────────────────────────────────────────────────────────────┐   
    RULE 2: Resource Conflict                                             
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━                                           
    Condition: Both parents assigned to conflicting tasks at same time   
    Confidence gate: extraction_confidence  0.6 (already verified)       
                                                                          
    Action: SPEAK  "🚨 John and Sarah both said they're covering      │   │
│  │          Tuesday 3pm pickup. Who's getting the kids?"                
  └─────────────────────────────────────────────────────────────────────┘   
                                                                              
  ┌─────────────────────────────────────────────────────────────────────┐   
    RULE 3: Missing Critical Variable                                     
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                       
    Condition: Coordination detected but 'assigned_to' or 'child'          
              is null AND activity is time-sensitive (<24h)              
    Confidence gate: extraction_confidence  0.6 (already verified)       
                                                                          
    Action: SPEAK  "⏳ I see 'someone' needs to cover Thursday 3pm    │   │
│  │          for Leo's early dismissal. Who's handling pickup?"            
  └─────────────────────────────────────────────────────────────────────┘   
                                                                              
  ┌─────────────────────────────────────────────────────────────────────┐   
    RULE 4: All Other Cases                                               
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━                                          
    Condition: No conflict, no missing critical variable                   
                                                                          
    Action: SILENT  Store extraction as 'pending' with [Confirm]          
            button (no chat message)                                      
  └─────────────────────────────────────────────────────────────────────┘   
└─────────────────────────────────────────────────────────────────────────────┘
                                      
                    ┌─────────────────┴─────────────────┐
                                                       
              ┌─────────┐                         ┌─────────────┐
                SILENT                              SPEAK    
              └────┬────┘                         └──────┬──────┘
                                                       
                                                       
┌──────────────────────────┐            ┌─────────────────────────────────────┐
  STORE AS PENDING                       POST TO GROUP CHAT               
  ━━━━━━━━━━━━━━━━━━━━━━━━               ━━━━━━━━━━━━━━━━━━━━━━━            
                                                                            
  Table: pending_actions                 Message includes:                
  Status: 'unconfirmed'                   Clear conflict description       
  UI: Telegram inline                     Suggested resolution             
      [Confirm] button                    [Resolve] inline button          
                                                                            
  Human can review via:                  Example:                          
   /pending command                     "⚠️ Conflict detected:             │
│  • Dashboard              │            │   Leo has soccer 4:30pm Tuesday,    │
│                           │            │   but message mentions chess club   │
│  [Confirm] →              │            │   at 4:00pm. Both?"                 
    writes to Event Graph                                                  
└──────────────────────────┘            └─────────────────────────────────────┘

Brevity Constraint (Cross-Reference Daedalus UX Specs)

Source: Daedalus UX Design System — Conversational Agents v2.1

Maximum Message Lengths

Message Type Max Length Rationale
Conflict alert 280 chars Fits in single Telegram bubble
Clarification request 200 chars Quick to read, easy to answer
Confirmation summary 140 chars Twitter-length, scannable

Brevity Patterns

BREVITY_TEMPLATES = {
    "temporal_conflict": "⚠️ {child} has {activity} at {time}, but message says {conflict}. Which is correct?",
    "resource_conflict": "🚨 Both {parent1} and {parent2} claim {task} on {day}. Who's covering?",
    "missing_assignment": "⏳ Someone needs to cover {child}'s {activity} on {day}. Who?",
    "missing_child": "⏳ {activity} mentioned for {day} — which child?",
}

Anti-Patterns (Never Do)

❌ "I noticed that in your message at 9:15 AM, you mentioned..."
❌ "Based on my analysis of the conversation history..."
❌ "It appears there may be a potential scheduling conflict..."
❌ Multi-paragraph explanations

✅ "⚠️ Leo has soccer 4:30pm Thursday. Message says dentist 4pm. Which?"
✅ "🚨 Both John and Sarah claim Tuesday pickup. Who's covering?"
✅ "⏳ Someone needs to cover Leo Thursday 3pm. Who?"

API Contract: Observer ↔ Brain

1. Brain Query Endpoint

Current: https://icarus-test.hoffdesk.com/brain/query?q={question}

For Observer Integration: Extend with structured query support

GET /brain/query?q={question}&context=observer&format=json

Headers:
  X-Observer-Request: true
  X-Family-Context: miller  # staging only

Response (current):
{
  "answer": "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM...",
  "sources": [...],
  "confidence": "high"
}

Response (extended for Observer):
{
  "answer": "...",
  "sources": [...],
  "confidence": "high",
  "relevance_score": 0.769,                # ← Used for brain_relevance gate
  "temporal_entities": [
    {"type": "time", "value": "16:30", "context": "soccer practice"},
    {"type": "day", "value": "Tuesday", "context": "recurring"}
  ],
  "extracted_events": [
    {
      "summary": "Leo Soccer Practice",
      "start": "2026-05-05T16:30:00",
      "location": "Westside Park"
    }
  ]
}

2. Observer → Brain Query Patterns

The Observer queries the Brain for context before making Redline decisions:

Observer Detected Brain Query Purpose
"Leo soccer Thursday" "When is Leo's soccer practice?" Verify against known schedule
"Mia has ballet Monday" "What days does Mia have ballet?" Check for conflicts
"pick up tomorrow 3pm" "Who usually picks up the kids on [day]?" Resource assignment history
"early dismissal Friday" "Any events on Friday involving school?" Temporal conflict check

3. Brain Query Constraints

# Observer-specific query limits
OBSERVER_BRAIN_CONFIG = {
    "max_queries_per_message": 3,      # Prevent query spam
    "query_timeout_ms": 5000,          # Fail fast if Brain slow
    "cache_ttl_seconds": 60,           # Cache recent queries
    "min_relevance_threshold": 0.5,    # Gate threshold (logged as brain_relevance)
    "staging_only": True,              # Enforce staging environment
}

Data Flow: Complete Walkthrough

Example 1: Conflict Detection (Speak)

Chat Message (Sarah → Group):

"I'll pick up Leo from school tomorrow at 3pm and take him to soccer"

Step-by-Step:

  1. Tripwire Match (Tier 1)
    - Patterns: "pick up", "tomorrow", "3pm", "soccer", "Leo"
    - tripwire_confidence: 0.85 (≥ 0.7 threshold ✓)
    - Action: Queue for Tier 2

  2. Async Processing (Tier 2)
    - Extract: {"date": "2026-05-01", "time": "15:00", "child": "leo", "activity": "soccer", "assigned_to": "sarah"}
    - extraction_confidence: 0.78 (≥ 0.6 threshold ✓)

  3. Brain Query
    - Query: "What time is Leo's soccer practice?"
    - Response: "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM at Westside Park"
    - brain_relevance: 0.82 (≥ 0.5 threshold ✓, use context)

  4. Redline Check
    - Message says: 3pm pickup, soccer implied
    - Brain says: Soccer is 4:30pm (1.5h later)
    - No direct conflict, BUT time gap is suspicious

  5. Decision: SPEAK (Rule 3 variant — clarification needed)
    "⏳ Leo pickup 3pm for soccer, but practice is 4:30pm. Different activity?"


Example 2: Silent Storage (No Conflict)

Chat Message (John → Group):

"Got the oil changed today. Next due in July."

Step-by-Step:

  1. Tripwire Match
    - Patterns: "today", "July", "oil changed"
    - tripwire_confidence: 0.65 (< 0.7 threshold ✗)
    - Action: DROP — not coordination-related

Example 3: Multi-Message Context Resolution (Silent)

Message A (Sarah, 09:00):

"Who's picking up Leo Thursday after soccer?"

Message B (John, 09:15):

"I'll do it"

Step-by-Step:

  1. Message A Tripwire
    - tripwire_confidence: 0.88 (question about assignment)
    - Queued for Tier 2
    - Extraction: {"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "unspecified"}
    - Stored with thread_id

  2. Message B Tripwire (Context-Aware)
    - Base score: 0.45 (just "I'll do it" — low confidence alone)
    - Context boost: +0.20 (short response in coordination thread)
    - Final tripwire_confidence: 0.65 (< 0.7 threshold...)
    - BUT: Attribution pattern detected
    - Override: Queue for Tier 2 with context window

  3. Tier 2 with Context
    - Fetches Message A via thread_id
    - Attribution logic: John (sender of B) → assignment from A
    - Final extraction: {"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "john", "attribution_confidence": 0.85}

  4. Brain Query
    - Query: "When is Leo's soccer practice?"
    - brain_relevance: 0.91 (confirms Thursday 4:30pm)

  5. Redline Check
    - Assigned to: John
    - Brain context: No conflict detected
    - Decision: SILENT

  6. Store as Pending
    - Type: care_assignment
    - Summary: "John assigned to pick up Leo from soccer Thursday 4:30pm"
    - [Confirm] button available


Example 4: Resource Conflict (Speak)

Chat Message 1 (John, 09:00):

"I can cover Tuesday pickup"

Chat Message 2 (Sarah, 09:15):

"I'll get the kids Tuesday after work"

Step-by-Step:

  1. Tripwire Matches
    - Both messages match assignment patterns
    - tripwire_confidence: 0.79, 0.82
    - Both queued for Tier 2

  2. Async Processing
    - Extract John: {"date": "2026-05-06", "assigned_to": "john", "task": "pickup"}

    • extraction_confidence: 0.74
    • Extract Sarah: {"date": "2026-05-06", "assigned_to": "sarah", "task": "pickup"}
    • extraction_confidence: 0.81
  3. Brain Query
    - Query for both: "Who usually picks up kids on Tuesdays?"
    - brain_relevance: 0.76 (historical pattern: usually John)

  4. Redline Check
    - Same date
    - Same task (pickup)
    - Different assignees
    - RESOURCE CONFLICT (Rule 2)

  5. Decision: SPEAK
    "🚨 John and Sarah both claim Tuesday pickup. Who's covering?" [John] [Sarah] [Both — carpool]


Database Schema

observer_messages — Raw Chat Log

CREATE TABLE observer_messages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    message_id TEXT UNIQUE NOT NULL,        -- Telegram message ID
    chat_id TEXT NOT NULL,                   -- Group chat ID
    thread_id TEXT,                          -- Links messages in same conversation
    sender TEXT NOT NULL,                    -- john | sarah | icarus
    message_text TEXT NOT NULL,
    sent_at TIMESTAMP NOT NULL,
    context_window TEXT,                     -- JSON array of prior message_ids

    -- Tier 1 Tripwire (Independent confidence)
    tripwire_confidence REAL,                -- 0.0-1.0, threshold 0.7
    tripwire_patterns TEXT,                  -- JSON array of matched patterns
    tier1_decision TEXT,                     -- DROP | QUEUE

    -- Tier 2 Processing
    processed_at TIMESTAMP,
    extracted_json TEXT,                     -- Full extraction JSON
    attribution_source TEXT,                 -- message_id this was resolved from

    -- Brain Query (Independent relevance score)
    brain_queries TEXT,                      -- JSON array of Brain queries made
    brain_context TEXT,                      -- Brain responses
    brain_relevance REAL,                    -- 0.0-1.0, threshold 0.5

    -- Redline Decision (Rule-based, no confidence score)
    redline_decision TEXT,                   -- SILENT | SPEAK
    redline_trigger_rule TEXT,               -- temporal_conflict | resource_conflict | missing_variable

    -- Outcome
    action_id TEXT,                          -- FK to pending_actions (if SILENT)
    notification_id TEXT,                    -- Telegram message ID (if SPEAK)

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at);
CREATE INDEX idx_observer_tripwire ON observer_messages(tripwire_confidence, tier1_decision);
CREATE INDEX idx_observer_decision ON observer_messages(redline_decision, processed_at);

observer_confidence_log — Debug Trail

CREATE TABLE observer_confidence_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    message_id TEXT NOT NULL,
    stage TEXT NOT NULL,                    -- 'tripwire' | 'extraction' | 'brain'
    confidence_type TEXT,                   -- 'regex_score' | 'llm_score' | 'relevance'
    score REAL NOT NULL,
    threshold REAL NOT NULL,
    passed BOOLEAN NOT NULL,
    decision TEXT NOT NULL,                 -- 'proceed' | 'drop' | 'proceed_no_context'
    logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_confidence_msg ON observer_confidence_log(message_id, stage);

pending_actions — Human-in-the-Loop Queue

CREATE TABLE pending_actions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    action_id TEXT UNIQUE NOT NULL,          -- UUID for this action

    -- Source
    source_type TEXT NOT NULL,               -- OBSERVER | EMAIL_MANUAL | etc
    source_id TEXT,                          -- FK to observer_messages or email_id

    -- Extracted State (what the system thinks it heard)
    action_type TEXT NOT NULL,               -- maintenance_update | event_create | 
                                             -- care_assignment | schedule_change
    extracted_json TEXT NOT NULL,            -- Full extraction
    extraction_confidence REAL NOT NULL,     -- Independent confidence (not blended)

    -- Human Review State
    status TEXT NOT NULL DEFAULT 'pending',  -- pending | confirmed | rejected | expired
    suggested_event_json TEXT,               -- What would be written to Event Graph

    -- UI/UX (Brevity constraint enforced)
    summary_text TEXT NOT NULL,              -- Human-readable summary (max 140 chars)
    confirm_payload TEXT,                    -- JSON to send on confirm

    -- Expiration
    expires_at TIMESTAMP,                    -- Auto-expire if not confirmed
    confirmed_at TIMESTAMP,
    confirmed_by TEXT,                       -- john | sarah

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_pending_status ON pending_actions(status, expires_at);
CREATE INDEX idx_pending_source ON pending_actions(source_type, source_id);

Error Handling & Fallback Paths

1. Brain Query Timeout

# If Brain doesn't respond in 5s
if brain_response.status == "timeout":
    # Log independent brain_relevance as NULL
    log_confidence(message_id, stage="brain", score=None, threshold=0.5, 
                   passed=False, decision="proceed_no_context")

    # Degrade gracefully: proceed without Brain context
    extraction["brain_context"] = None
    extraction["brain_relevance"] = None

    # Decision: Evaluate redline with extraction-only (lower threshold)
    redline_decision = evaluate_redline(extraction, has_brain_context=False)
    reason = "brain_timeout_degraded"

2. Brain Returns Low Relevance

# If top score < 0.5
if brain_response.relevance_score < 0.5:
    # Log independent brain_relevance
    log_confidence(message_id, stage="brain", score=brain_response.relevance_score, 
                   threshold=0.5, passed=False, decision="proceed_no_context")

    # No useful context found
    extraction["brain_context"] = None

    # Decision: Depends on extraction_confidence (independent gate)
    if extraction["extraction_confidence"] > 0.8:
        redline_decision = evaluate_redline(extraction, has_brain_context=False)
    else:
        redline_decision = "SILENT"
        reason = "low_confidence_no_brain"

3. Tier 2 Worker Crash

# Celery retry with exponential backoff
@celery.task(bind=True, max_retries=3)
def process_tier2(self, message_id):
    try:
        # ... processing logic ...
    except Exception as e:
        # Retry in 30s, 2min, 5min
        raise self.retry(countdown=30 * (2 ** self.request.retries))

# After 3 failures: mark for manual review
if self.request.retries >= 3:
    mark_for_manual_review(message_id, error=str(e))

4. Duplicate Detection

# Prevent duplicate pending actions for same event
def check_duplicate(extraction):
    similar = db.query("""
        SELECT * FROM pending_actions 
        WHERE action_type = ? 
        AND json_extract(extracted_json, '$.dates') = ?
        AND json_extract(extracted_json, '$.child') = ?
        AND status = 'pending'
        AND created_at > datetime('now', '-1 hour')
    """, [extraction["type"], extraction["dates"], extraction["child"]])

    if similar:
        # Merge or skip
        return "duplicate_detected"

Staging Environment Configuration

File: .env.staging.observer

# Environment
ENV=staging
DB_PATH=./data/staging.db
FAMILY_CONTEXT=./TEST_FAMILY_CONTEXT.md

# Telegram (Test Bot)
TELEGRAM_BOT_TOKEN=${TELEGRAM_TEST_BOT_TOKEN}
TELEGRAM_GROUP_CHAT_ID=${TELEGRAM_TEST_GROUP_ID}

# Brain Integration
BRAIN_BASE_URL=https://icarus-test.hoffdesk.com
BRAIN_QUERY_ENDPOINT=/brain/query
BRAIN_TIMEOUT_MS=5000
BRAIN_MIN_RELEVANCE=0.5              # Gate threshold for brain_relevance

# LLM (Gaming PC via Tailscale)
OLLAMA_HOST=http://matt-pc.tail864e81.ts.net:11434
OLLAMA_MODEL_TIER2=phi4:14b

# Tripwire
TRIPWIRE_THRESHOLD=0.7               # Independent gate threshold
EXTRACTION_THRESHOLD=0.6             # Independent gate threshold
TRIPWIRE_PATTERNS_PATH=./config/tripwire_patterns.json

# Multi-Message Context
CONTEXT_WINDOW_SIZE=3
CONTEXT_TTL_SECONDS=300
ATTRIBUTION_WINDOW_SECONDS=600

# Async Processing
REDIS_URL=redis://localhost:6379/0
CELERY_WORKERS=2

# Redline Rules
REDLINE_TEMPORAL_CONFLICT=true
REDLINE_RESOURCE_CONFLICT=true
REDLINE_MISSING_VARIABLE=true
SPEAK_ON_MISSING_CRITICAL=true

# Limits
MAX_BRAIN_QUERIES_PER_MESSAGE=3
PENDING_ACTION_TTL_HOURS=48
MAX_MESSAGES_PER_MINUTE=60

# Brevity Constraint (Daedalus UX Spec)
MAX_CONFLICT_MESSAGE_LENGTH=280
MAX_CLARIFICATION_LENGTH=200
MAX_SUMMARY_LENGTH=140

Miller Family Test Context

Test Data Pre-Seeded:
- Leo's soccer: Tuesdays/Thursdays 4:30 PM @ Westside Park
- Mia's ballet: Mondays/Wednesdays 4:00 PM @ Madison Dance Academy
- Mia's swim: Saturdays 10:00 AM @ YMCA
- Leo's chess: Wednesdays 3:30 PM
- Spring break: March 24-28, 2026
- Honda Civic service: Due April 15, 2026


Implementation Phases

Week 0.5: Data Collection & Labeling (NEW — Critical)

Before any implementation:

  • [ ] Record 100 real messages from Family Logistics group
  • [ ] Hand-label for coordination signals:
  • is_coordination: true/false
  • coordination_type: transport | care_coverage | schedule_change | activity_confirm | none
  • attribution_required: true/false (for multi-message context)
  • [ ] Extract noisy data examples:
  • Typos: "socer", "pratice", "balet"
  • Abbreviations: "wk" (week), "2moro", "thx"
  • Emoji: "👍", "🙋", "✅"
  • Half-sentences: "yeah 3pm", "me", "i'll do it"
  • Mixed signals: "Leo soccer thursday??" (question vs statement)
  • [ ] Tune tripwire regex against noisy data
  • [ ] Establish baseline precision/recall before code is written

Noisy Data Test Cases (TC-006 to TC-020):

TC Message Expected Notes
TC-006 "socer tmrw 4" Queue Typo: "socer", abbreviation: "tmrw"
TC-007 "👍" (reply to coordination) Queue Emoji-only in context
TC-008 "yeah 3pm" Queue Half-sentence, needs context
TC-009 "i'll do it" Queue Pronoun reference, needs attribution
TC-010 "wk pickup" Queue Abbreviation: "wk"
TC-011 "pratice is cancled" Queue Multiple typos
TC-012 "thx!" Drop Gratitude, not coordination
TC-013 "Mia has bailt monday" Queue Typo in activity name
TC-014 "u sure?" Depends Question, usually drop unless context says otherwise
TC-015 "4:30 work for me" Queue Time-only, needs context
TC-016 "🙋‍♀️" Queue Self-assignment via emoji
TC-017 "kk" Depends Acknowledgment, context-dependent
TC-018 "Leo dentist 2mrw??" Queue Question mark + abbreviation
TC-019 "n/m" Drop "Never mind" — cancels previous
TC-020 "omw" Queue "On my way" — active coordination

Phase 8.1: Tripwire + Logger (Week 1)

  • [ ] Implement Tier 1 Tripwire (regex) with independent confidence scoring
  • [ ] Create observer_messages table with thread support
  • [ ] Create observer_confidence_log table for debugging
  • [ ] Log all messages, mark 95% as DROP
  • [ ] Deploy to staging, run for 3 days
  • [ ] Measure: false positive rate against Week 0.5 labeled data (target <5%)

Phase 8.2: Async Tier 2 (Week 2)

  • [ ] Implement Celery worker for Tier 2
  • [ ] Integration with Brain query endpoint (independent relevance scoring)
  • [ ] Create pending_actions table
  • [ ] Implement [Confirm] button flow
  • [ ] Deploy to staging, run for 3 days

Phase 8.3: Redline Rules + Multi-Message Context (Week 3) — P0 Priority

  • [ ] Implement temporal conflict detection (Rule 1)
  • [ ] Implement resource conflict detection (Rule 2)
  • [ ] Implement missing variable detection (Rule 3)
  • [ ] Add multi-message context window (3 messages)
  • [ ] Implement attribution logic for pronoun resolution
  • [ ] Add SPEAK pathway for conflicts
  • [ ] Enforce brevity constraints (Daedalus UX spec)
  • [ ] Deploy to staging with Miller family

Multi-Message Test Scenarios:

Scenario Messages Expected Confidence
TC-021 A: "Who's getting Leo?" / B: "me" Attribution → John extraction_confidence ≥ 0.75
TC-022 A: "Soccer pickup Thursday?" / B: "I'll do it" / C: "thx" Attribution + silent C drops
TC-023 A: "Mia ballet Monday" / B: "I'll cover" / C: "👍" Attribution + silent C is emoji acknowledgment
TC-024 A: "Both kids Tuesday" / B: "I got Leo" / C: "I got Mia" Double attribution + silent Unless times conflict

Phase 8.4: Integration Test (Week 4)

  • [ ] Run full test scenarios (TC-001 to TC-024)
  • [ ] Measure: independent confidence gate accuracy
  • [ ] Measure: multi-message attribution accuracy
  • [ ] Measure: noisy data handling
  • [ ] Measure: family comfort metrics (see Success Metrics below)
  • [ ] Adjust thresholds based on results
  • [ ] Document learnings for production

Success Metrics

Primary Metric: Family Comfort (Non-Negotiable)

UX Acceptance Gate:

"Aundrea has never asked 'why did the bot just say that?' for 7 consecutive days"

This is the only metric that matters. Technical precision/recall are secondary to human experience.

Secondary Metrics

Metric Target Measurement
Tripwire Precision >95% False positive rate on labeled data (Week 0.5)
Tripwire Recall >90% % of coordination messages caught
Extraction Confidence Accuracy >85% Correlation between extraction_confidence and human-verified correctness
Brain Relevance Accuracy >80% Correlation between brain_relevance and context usefulness
Multi-Message Attribution >85% Correct "I'll do it" → prior action resolution
Brain Query Latency <3s Average response time
Brain Query Success >95% % of queries returning valid response
Redline Accuracy >90% Correct SPEAK vs SILENT decisions (measured by human review)
Pending Confirmation Rate >70% % of silent extractions eventually confirmed
Chat Interruptions <2/day SPEAK messages per day (excluding test scenarios)
Aundrea Confusion Events 0 in 7 days Number of times primary user questions bot behavior

Anti-Metric (What NOT to Optimize)

  • Don't optimize for: Message extraction volume
  • Don't optimize for: Brain query coverage
  • Don't optimize for: Total features detected

Optimize for: Family comfort. Full stop.


Security & Privacy

Data Handling

  • No message storage beyond extraction: Raw chat messages purged after 30 days
  • No PII in Brain queries: Queries are abstract ("soccer schedule" not "where is Leo")
  • Local-first: All processing on Beelink/Gaming PC, no cloud LLM for chat

Telegram Privacy

  • Privacy Mode: Disabled for test bot (BotFather → Group Privacy → Off)
  • Bot can see: All group messages
  • Bot cannot: See direct messages unless replied

Staging Isolation

  • Database: staging.db completely separate from prod.db
  • Bot: Separate test bot token
  • Family: Miller family test data only
  • No production data: Observer will NEVER see real Hoffmann family chat

Open Questions

  1. Conflict Window: How close is "conflict"? Same day? ±2 hours? → Resolved: Same day for now; temporal overlap detection TBD
  2. Override: Can a parent "force" a silent extraction to speak? (e.g., urgent) → Open: Not for Phase 8
  3. Learning: Should Observer learn from confirmed/rejected actions? (Phase 8.5?) → Open: Post-family-adoption
  4. Multi-message Context: Should attribution expire? → Resolved: 10-minute window

Appendix A: Tripwire Pattern Reference

{
  "tripwire_patterns": {
    "temporal": [
      "\\b(tomorrow|today|next week|this week)\\b",
      "\\b(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2}\\b",
      "\\b(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\\b",
      "\\b\\d{1,2}:\\d{2}\\b",
      "\\b\\d{1,2}(\\s)?(am|pm)\\b"
    ],
    "assignment": [
      "\\b(I('ll| will)|can) (pick up|get|cover|take|bring)\\b",
      "\\b(who('s| is) (getting|picking up|covering))\\b",
      "\\b(cover( for)?|swap|switch)\\b",
      "\\b(me|me too|i'll do it|i got it)\\b"
    ],
    "pronoun_attribution": [
      "\\b(i'll|i will|i can|i'm|i am|me|me too)\\b"
    ],
    "children": [
      "\\b(Leo|Mia)\\b",
      "\\b(kids?|children)\\b",
      "\\b(son|daughter)\\b"
    ],
    "activities": [
      "\\b(soccer|ballet|swim|chess|practice|lesson|class)\\b",
      "\\b(game|match|meet|recital|performance)\\b"
    ],
    "conflict_markers": [
      "\\b(but|wait|hold on|doesn't|didn't|conflict|overlap)\\b",
      "\\b(what about|how about)\\b"
    ],
    "noise_patterns": [
      "\\b(thx|thanks|ty|👍|✅|🙋|kk|omw)\\b"
    ]
  }
}

Appendix B: Mermaid Diagram (Simplified)

flowchart TD
    A[Family Chat Message] --> B{Tier 1 Tripwire}
    B -->|tripwire_confidence < 0.7| C[Drop Silently]
    B -->|tripwire_confidence >= 0.7| D[Queue for Tier 2]

    D --> E[Async Tier 2 Worker]
    E --> F[Extract Coordination State]
    F -->|extraction_confidence < 0.6| C
    F -->|extraction_confidence >= 0.6| G[Query Brain for Context]

    G --> H{brain_relevance >= 0.5?}
    H -->|Yes| I[Redline with Brain Context]
    H -->|No| J[Redline Extraction-Only]

    I --> K{Redline Decision}
    J --> K

    K -->|No Conflict| L[Store as Pending]
    K -->|Conflict Detected| M[Speak in Chat]

    L --> N[[Confirm Button]]
    N -->|Confirmed| O[Write to Event Graph]
    N -->|Rejected| P[Discard]

    M --> Q[[Resolve Button]]
    Q --> R[Update Event Graph]

The conversation is the interface. The brain provides memory. The observer knows when to speak. The family never asks "why did the bot just say that?"