# Silent Observer — Brain Intelligence Integration Spec **Status:** Draft v1.1 (Director Revisions) **Date:** 2026-04-30 **Author:** Wadsworth (Chief of Staff) **Owner:** Socrates (Backend Architecture) **Scope:** Phase 8 Integration Layer --- ## Executive Summary This document specifies the integration architecture between the **Silent Observer** (Phase 8 — ambient chat monitoring) and the **Brain Intelligence** (Phase 6/7 — RAG-based knowledge retrieval). Together, they enable zero-UI household coordination where: - The Observer listens to family chat without interrupting - The Brain provides memory context for decisions - Icarus speaks only when there's a real conflict or missing critical variable **Non-Negotiable Constraints (Director-Level):** 1. **State Protection (No Auto-Writes):** All extractions require Human-in-the-Loop [Confirm] 2. **Asynchronous Processing:** Tier 1 releases chat thread immediately; Tier 2 is async 3. **The 'Redline' Speak Rule:** Silence is default; speak only on temporal/resource conflicts 4. **Staging Environment Only:** Connects only to staging.db with Miller Family context --- ## Confidence Field Unification **CRITICAL:** Confidence scores are NEVER multiplied or blended. Each stage has independent gates with clear thresholds. ### Stage Gates (Sequential, Not Multiplicative) ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ CONFIDENCE GATE ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Stage 1: Tripwire Confidence │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Calculated by: Regex pattern matcher (Tier 1) │ │ • Threshold: ≥0.7 to proceed to extraction │ │ • Logged field: tripwire_confidence (independent, 0.0-1.0) │ │ • If <0.7: DROP — message not queued for Tier 2 │ │ │ │ ↓ tripwire_confidence ≥ 0.7 │ │ │ │ Stage 2: Extraction Confidence │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Calculated by: LLM extraction model (Tier 2) │ │ • Threshold: ≥0.6 for usable extraction │ │ • Logged field: extraction_confidence (independent, 0.0-1.0) │ │ • If <0.6: DROP — low-quality extraction, don't query Brain │ │ │ │ ↓ extraction_confidence ≥ 0.6 │ │ │ │ Stage 3: Brain Relevance Score │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Calculated by: Brain RAG retriever │ │ • Threshold: ≥0.5 to use Brain context in redline decision │ │ • Logged field: brain_relevance (independent, 0.0-1.0) │ │ • If <0.5: Proceed without Brain context (extraction-only redline) │ │ │ │ ↓ brain_relevance ≥ 0.5 (optional) │ │ │ │ Stage 4: Redline Decision │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Calculated by: Rule-based conflict engine (no ML) │ │ • Output: SILENT | SPEAK │ │ • Logged field: redline_decision, redline_trigger_rule │ │ • Does NOT use confidence scores — uses extracted entities only │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ### Confidence Logging Schema ```sql -- Each confidence is logged independently for debugging CREATE TABLE observer_confidence_log ( message_id TEXT NOT NULL, stage TEXT NOT NULL, -- 'tripwire' | 'extraction' | 'brain' confidence_type TEXT, -- 'regex_score' | 'llm_score' | 'relevance' score REAL NOT NULL, threshold REAL NOT NULL, passed BOOLEAN NOT NULL, decision TEXT NOT NULL, -- 'proceed' | 'drop' | 'proceed_no_context' logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` ### Anti-Pattern: NEVER Do This ```python # WRONG — blending scores creates debugging nightmare final_confidence = tripwire_score * extraction_score * brain_score if final_confidence > 0.5: # What failed? Who knows. ... # CORRECT — each gate independent, clear failure point if tripwire_score < 0.7: log("TRIPWIRE_REJECT", score=tripwire_score, threshold=0.7) return DROP if extraction_score < 0.6: log("EXTRACTION_REJECT", score=extraction_score, threshold=0.6) return DROP # Brain relevance is optional — proceed with or without use_brain = brain_relevance >= 0.5 ``` --- ## Multi-Message Context Architecture **Status:** P0 Priority — Moved to Phase 8.3 (was deferred) **Why:** 60% of real coordination happens across multiple messages. Single-message extraction misses critical context. ### The Problem ``` Message A (09:00): "Who's picking up Leo Thursday after soccer?" Message B (09:15): "I'll do it" ↑ Who is "I"? What activity? What time? Without Message A, Message B is unresolvable. ``` ### Context Window Specification ```python MULTI_MESSAGE_CONFIG = { "context_window_size": 3, # Last 3 messages in thread "context_ttl_seconds": 300, # Messages expire after 5 minutes "attribution_window": 600, # Look back 10 min for matching context "max_participants": 4, # Family group size limit } ``` ### Attribution Logic ```python def resolve_attribution(current_message, context_window): """ Match "I'll do it" to the action in prior messages. Returns: resolved_assignment or None """ # Step 1: Extract from current message current = extract(current_message) if current.get("assigned_to") is not None: # Already has assignment (e.g., "John will pick up") return current # Step 2: Look for pronoun references in current message pronouns = ["i'll", "i will", "i can", "i'm", "i am", "me"] if not any(p in current_message.text.lower() for p in pronouns): # Not an assignment message return current # Step 3: Search context window for matching coordination for prior in reversed(context_window): prior_extracted = prior.get("extraction", {}) # Match criteria: # 1. Prior has missing assignment (who's/who is/can someone) # 2. Same temporal scope (date overlap) # 3. Same activity or child mentioned if ( prior_extracted.get("assigned_to") == "unspecified" and dates_overlap(current.get("dates"), prior_extracted.get("dates")) and (activities_match(current.get("activity"), prior_extracted.get("activity")) or children_match(current.get("child"), prior_extracted.get("child"))) ): # Attribution found current["assigned_to"] = current_message.sender # "I" = sender of Message B current["attributed_to_message_id"] = prior["message_id"] current["attribution_confidence"] = 0.85 return current # No match found — extraction incomplete current["assigned_to"] = "unspecified" current["attribution_confidence"] = 0.3 return current ``` ### Context-Aware Tripwire ```python # Tripwire now operates on message thread, not just single message def tripwire_with_context(message, context_window): # Score current message alone base_score = pattern_match_score(message.text) # Boost score if context suggests coordination thread if is_coordination_thread(context_window): # Prior messages contained questions about assignments base_score = min(1.0, base_score + 0.15) # Boost if current message is short response in context thread if ( len(message.text.split()) <= 5 and # "I'll do it", "yeah me", "sure" is_coordination_thread(context_window) ): base_score = min(1.0, base_score + 0.20) return base_score ``` ### Database Schema Updates ```sql -- Track message threading for multi-message context ALTER TABLE observer_messages ADD COLUMN thread_id TEXT; ALTER TABLE observer_messages ADD COLUMN context_window TEXT; -- JSON array of prior message_ids ALTER TABLE observer_messages ADD COLUMN attribution_source TEXT; -- message_id this was resolved from -- Index for thread lookups CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at); ``` --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ FAMILY CHAT (Telegram Group) │ │ Members: John, Sarah, Icarus (bot) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ TIER 1: TRIPWIRE (Python/Regex) │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Latency: <10ms │ │ • Resource: CPU only (Beelink) │ │ • Pattern match: dates, times, coordination keywords │ │ • Context-aware: Scans last 3 messages for thread continuity │ │ • Immediate release of chat thread │ │ │ │ Pattern Categories: │ │ ├── Temporal: "tomorrow", "June 4th", "next week", "Tuesday 3pm" │ │ ├── Assignment: "I'll pick up", "can you cover", "who's getting" │ │ ├── Pronoun Resolution: "I'll do it", "me too", "yeah" (needs context) │ │ ├── Children: "Leo", "Mia", "kids", "the children" │ │ ├── Activities: "soccer", "ballet", "swim", "chess", "practice" │ │ └── Conflict markers: "but", "wait", "doesn't", "conflict", "overlap" │ └─────────────────────────────────────────────────────────────────────────────┘ │ ┌─────────────────┴─────────────────┐ ▼ ▼ NO MATCH (≥95%) MATCH (≤5%) ┌─────────────────────┐ ┌─────────────────────┐ │ Drop Silently │ │ Queue for Tier 2 │ │ Log to observer_ │ │ (Redis/Queue) │ │ messages table │ │ Return 200 OK to │ │ (no processing) │ │ Telegram │ └─────────────────────┘ └─────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ TIER 2: ASYNC PROCESSING (8B LLM + Brain Query) │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ • Worker: Celery/Background task (Beelink) │ │ • LLM: phi4:14b or llama3.1:8b via Gaming PC (Tailscale) │ │ • Latency: 1-3s (acceptable — async) │ │ • Brain Query: HTTPS to icarus-test.hoffdesk.com/brain/query │ │ • Context: Fetches last 3 messages for multi-message resolution │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ COORDINATION STATE EXTRACTOR │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Input: Chat message + Context window (last 3 messages) + Brain context │ │ Output: Structured extraction with independent confidence scores │ │ │ │ { │ │ "is_coordination": true | false, │ │ "coordination_type": "transport" | "care_coverage" | │ │ "schedule_change" | "activity_confirm" | null, │ │ "dates": ["2026-06-04", "2026-06-05"], │ │ "times": ["15:30", "16:00"], │ │ "assigned_to": ["john" | "sarah" | "unspecified"], │ │ "child": ["leo" | "mia" | "both" | null], │ │ "activity": "soccer" | "ballet" | "swim" | "chess" | null, │ │ "location": "Westside Park" | "Madison Dance Academy" | null, │ │ "action_required": "confirm" | "resolve_conflict" | "notify" | null, │ │ "extracted_entities": [...], │ │ "attribution": { │ │ "source_message_id": "...", │ │ "attribution_confidence": 0.85 │ │ }, │ │ "brain_query_context": {...} | null, │ │ "confidence_scores": { │ │ "tripwire_confidence": 0.85, │ │ "extraction_confidence": 0.72, │ │ "brain_relevance": 0.68 │ │ } │ │ } │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ REDLINE DECISION ENGINE │ │ ━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ RULE 1: Temporal Conflict │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Condition: Extracted date/time overlaps with existing Event Graph │ │ │ │ entry for same child + overlapping times │ │ │ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │ │ │ │ │ │ │ Action: SPEAK — "⚠️ Leo has soccer at 4:30pm Thursday, but this │ │ │ │ message mentions a dentist appointment at 4:00pm. │ │ │ │ Which is correct?" │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ RULE 2: Resource Conflict │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Condition: Both parents assigned to conflicting tasks at same time│ │ │ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │ │ │ │ │ │ │ Action: SPEAK — "🚨 John and Sarah both said they're covering │ │ │ │ Tuesday 3pm pickup. Who's getting the kids?" │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ RULE 3: Missing Critical Variable │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Condition: Coordination detected but 'assigned_to' or 'child' │ │ │ │ is null AND activity is time-sensitive (<24h) │ │ │ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │ │ │ │ │ │ │ Action: SPEAK — "⏳ I see 'someone' needs to cover Thursday 3pm │ │ │ │ for Leo's early dismissal. Who's handling pickup?" │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ RULE 4: All Other Cases │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ Condition: No conflict, no missing critical variable │ │ │ │ │ │ │ │ Action: SILENT — Store extraction as 'pending' with [Confirm] │ │ │ │ button (no chat message) │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ ┌─────────────────┴─────────────────┐ ▼ ▼ ┌─────────┐ ┌─────────────┐ │ SILENT │ │ SPEAK │ └────┬────┘ └──────┬──────┘ │ │ ▼ ▼ ┌──────────────────────────┐ ┌─────────────────────────────────────┐ │ STORE AS PENDING │ │ POST TO GROUP CHAT │ │ ━━━━━━━━━━━━━━━━━━━━━━━━ │ │ ━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ │ │ │ Table: pending_actions │ │ Message includes: │ │ Status: 'unconfirmed' │ │ • Clear conflict description │ │ UI: Telegram inline │ │ • Suggested resolution │ │ [Confirm] button │ │ • [Resolve] inline button │ │ │ │ │ │ Human can review via: │ │ Example: │ │ • /pending command │ │ "⚠️ Conflict detected: │ │ • Dashboard │ │ Leo has soccer 4:30pm Tuesday, │ │ │ │ but message mentions chess club │ │ [Confirm] → │ │ at 4:00pm. Both?" │ │ writes to Event Graph │ │ │ └──────────────────────────┘ └─────────────────────────────────────┘ ``` --- ## Brevity Constraint (Cross-Reference Daedalus UX Specs) **Source:** Daedalus UX Design System — Conversational Agents v2.1 ### Maximum Message Lengths | Message Type | Max Length | Rationale | |--------------|------------|-----------| | Conflict alert | 280 chars | Fits in single Telegram bubble | | Clarification request | 200 chars | Quick to read, easy to answer | | Confirmation summary | 140 chars | Twitter-length, scannable | ### Brevity Patterns ```python BREVITY_TEMPLATES = { "temporal_conflict": "⚠️ {child} has {activity} at {time}, but message says {conflict}. Which is correct?", "resource_conflict": "🚨 Both {parent1} and {parent2} claim {task} on {day}. Who's covering?", "missing_assignment": "⏳ Someone needs to cover {child}'s {activity} on {day}. Who?", "missing_child": "⏳ {activity} mentioned for {day} — which child?", } ``` ### Anti-Patterns (Never Do) ``` ❌ "I noticed that in your message at 9:15 AM, you mentioned..." ❌ "Based on my analysis of the conversation history..." ❌ "It appears there may be a potential scheduling conflict..." ❌ Multi-paragraph explanations ✅ "⚠️ Leo has soccer 4:30pm Thursday. Message says dentist 4pm. Which?" ✅ "🚨 Both John and Sarah claim Tuesday pickup. Who's covering?" ✅ "⏳ Someone needs to cover Leo Thursday 3pm. Who?" ``` --- ## API Contract: Observer ↔ Brain ### 1. Brain Query Endpoint **Current:** `https://icarus-test.hoffdesk.com/brain/query?q={question}` **For Observer Integration:** Extend with structured query support ```http GET /brain/query?q={question}&context=observer&format=json Headers: X-Observer-Request: true X-Family-Context: miller # staging only Response (current): { "answer": "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM...", "sources": [...], "confidence": "high" } Response (extended for Observer): { "answer": "...", "sources": [...], "confidence": "high", "relevance_score": 0.769, # ← Used for brain_relevance gate "temporal_entities": [ {"type": "time", "value": "16:30", "context": "soccer practice"}, {"type": "day", "value": "Tuesday", "context": "recurring"} ], "extracted_events": [ { "summary": "Leo Soccer Practice", "start": "2026-05-05T16:30:00", "location": "Westside Park" } ] } ``` ### 2. Observer → Brain Query Patterns The Observer queries the Brain for context before making Redline decisions: | Observer Detected | Brain Query | Purpose | |-------------------|-------------|---------| | "Leo soccer Thursday" | "When is Leo's soccer practice?" | Verify against known schedule | | "Mia has ballet Monday" | "What days does Mia have ballet?" | Check for conflicts | | "pick up tomorrow 3pm" | "Who usually picks up the kids on [day]?" | Resource assignment history | | "early dismissal Friday" | "Any events on Friday involving school?" | Temporal conflict check | ### 3. Brain Query Constraints ```python # Observer-specific query limits OBSERVER_BRAIN_CONFIG = { "max_queries_per_message": 3, # Prevent query spam "query_timeout_ms": 5000, # Fail fast if Brain slow "cache_ttl_seconds": 60, # Cache recent queries "min_relevance_threshold": 0.5, # Gate threshold (logged as brain_relevance) "staging_only": True, # Enforce staging environment } ``` --- ## Data Flow: Complete Walkthrough ### Example 1: Conflict Detection (Speak) **Chat Message (Sarah → Group):** > "I'll pick up Leo from school tomorrow at 3pm and take him to soccer" **Step-by-Step:** 1. **Tripwire Match** (Tier 1) - Patterns: "pick up", "tomorrow", "3pm", "soccer", "Leo" - tripwire_confidence: 0.85 (≥ 0.7 threshold ✓) - Action: Queue for Tier 2 2. **Async Processing** (Tier 2) - Extract: `{"date": "2026-05-01", "time": "15:00", "child": "leo", "activity": "soccer", "assigned_to": "sarah"}` - extraction_confidence: 0.78 (≥ 0.6 threshold ✓) 3. **Brain Query** - Query: "What time is Leo's soccer practice?" - Response: "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM at Westside Park" - brain_relevance: 0.82 (≥ 0.5 threshold ✓, use context) 4. **Redline Check** - Message says: 3pm pickup, soccer implied - Brain says: Soccer is 4:30pm (1.5h later) - No direct conflict, BUT time gap is suspicious 5. **Decision: SPEAK** (Rule 3 variant — clarification needed) ``` "⏳ Leo pickup 3pm for soccer, but practice is 4:30pm. Different activity?" ``` --- ### Example 2: Silent Storage (No Conflict) **Chat Message (John → Group):** > "Got the oil changed today. Next due in July." **Step-by-Step:** 1. **Tripwire Match** - Patterns: "today", "July", "oil changed" - tripwire_confidence: 0.65 (< 0.7 threshold ✗) - Action: DROP — not coordination-related --- ### Example 3: Multi-Message Context Resolution (Silent) **Message A (Sarah, 09:00):** > "Who's picking up Leo Thursday after soccer?" **Message B (John, 09:15):** > "I'll do it" **Step-by-Step:** 1. **Message A Tripwire** - tripwire_confidence: 0.88 (question about assignment) - Queued for Tier 2 - Extraction: `{"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "unspecified"}` - Stored with thread_id 2. **Message B Tripwire (Context-Aware)** - Base score: 0.45 (just "I'll do it" — low confidence alone) - Context boost: +0.20 (short response in coordination thread) - Final tripwire_confidence: 0.65 (< 0.7 threshold...) - BUT: Attribution pattern detected - Override: Queue for Tier 2 with context window 3. **Tier 2 with Context** - Fetches Message A via thread_id - Attribution logic: John (sender of B) → assignment from A - Final extraction: `{"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "john", "attribution_confidence": 0.85}` 4. **Brain Query** - Query: "When is Leo's soccer practice?" - brain_relevance: 0.91 (confirms Thursday 4:30pm) 5. **Redline Check** - Assigned to: John - Brain context: No conflict detected - Decision: SILENT 6. **Store as Pending** - Type: `care_assignment` - Summary: "John assigned to pick up Leo from soccer Thursday 4:30pm" - [Confirm] button available --- ### Example 4: Resource Conflict (Speak) **Chat Message 1 (John, 09:00):** > "I can cover Tuesday pickup" **Chat Message 2 (Sarah, 09:15):** > "I'll get the kids Tuesday after work" **Step-by-Step:** 1. **Tripwire Matches** - Both messages match assignment patterns - tripwire_confidence: 0.79, 0.82 - Both queued for Tier 2 2. **Async Processing** - Extract John: `{"date": "2026-05-06", "assigned_to": "john", "task": "pickup"}` - extraction_confidence: 0.74 - Extract Sarah: `{"date": "2026-05-06", "assigned_to": "sarah", "task": "pickup"}` - extraction_confidence: 0.81 3. **Brain Query** - Query for both: "Who usually picks up kids on Tuesdays?" - brain_relevance: 0.76 (historical pattern: usually John) 4. **Redline Check** - Same date - Same task (pickup) - Different assignees - **RESOURCE CONFLICT** (Rule 2) 5. **Decision: SPEAK** ``` "🚨 John and Sarah both claim Tuesday pickup. Who's covering?" [John] [Sarah] [Both — carpool] ``` --- ## Database Schema ### `observer_messages` — Raw Chat Log ```sql CREATE TABLE observer_messages ( id INTEGER PRIMARY KEY AUTOINCREMENT, message_id TEXT UNIQUE NOT NULL, -- Telegram message ID chat_id TEXT NOT NULL, -- Group chat ID thread_id TEXT, -- Links messages in same conversation sender TEXT NOT NULL, -- john | sarah | icarus message_text TEXT NOT NULL, sent_at TIMESTAMP NOT NULL, context_window TEXT, -- JSON array of prior message_ids -- Tier 1 Tripwire (Independent confidence) tripwire_confidence REAL, -- 0.0-1.0, threshold 0.7 tripwire_patterns TEXT, -- JSON array of matched patterns tier1_decision TEXT, -- DROP | QUEUE -- Tier 2 Processing processed_at TIMESTAMP, extracted_json TEXT, -- Full extraction JSON attribution_source TEXT, -- message_id this was resolved from -- Brain Query (Independent relevance score) brain_queries TEXT, -- JSON array of Brain queries made brain_context TEXT, -- Brain responses brain_relevance REAL, -- 0.0-1.0, threshold 0.5 -- Redline Decision (Rule-based, no confidence score) redline_decision TEXT, -- SILENT | SPEAK redline_trigger_rule TEXT, -- temporal_conflict | resource_conflict | missing_variable -- Outcome action_id TEXT, -- FK to pending_actions (if SILENT) notification_id TEXT, -- Telegram message ID (if SPEAK) created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at); CREATE INDEX idx_observer_tripwire ON observer_messages(tripwire_confidence, tier1_decision); CREATE INDEX idx_observer_decision ON observer_messages(redline_decision, processed_at); ``` ### `observer_confidence_log` — Debug Trail ```sql CREATE TABLE observer_confidence_log ( id INTEGER PRIMARY KEY AUTOINCREMENT, message_id TEXT NOT NULL, stage TEXT NOT NULL, -- 'tripwire' | 'extraction' | 'brain' confidence_type TEXT, -- 'regex_score' | 'llm_score' | 'relevance' score REAL NOT NULL, threshold REAL NOT NULL, passed BOOLEAN NOT NULL, decision TEXT NOT NULL, -- 'proceed' | 'drop' | 'proceed_no_context' logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_confidence_msg ON observer_confidence_log(message_id, stage); ``` ### `pending_actions` — Human-in-the-Loop Queue ```sql CREATE TABLE pending_actions ( id INTEGER PRIMARY KEY AUTOINCREMENT, action_id TEXT UNIQUE NOT NULL, -- UUID for this action -- Source source_type TEXT NOT NULL, -- OBSERVER | EMAIL_MANUAL | etc source_id TEXT, -- FK to observer_messages or email_id -- Extracted State (what the system thinks it heard) action_type TEXT NOT NULL, -- maintenance_update | event_create | -- care_assignment | schedule_change extracted_json TEXT NOT NULL, -- Full extraction extraction_confidence REAL NOT NULL, -- Independent confidence (not blended) -- Human Review State status TEXT NOT NULL DEFAULT 'pending', -- pending | confirmed | rejected | expired suggested_event_json TEXT, -- What would be written to Event Graph -- UI/UX (Brevity constraint enforced) summary_text TEXT NOT NULL, -- Human-readable summary (max 140 chars) confirm_payload TEXT, -- JSON to send on confirm -- Expiration expires_at TIMESTAMP, -- Auto-expire if not confirmed confirmed_at TIMESTAMP, confirmed_by TEXT, -- john | sarah created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_pending_status ON pending_actions(status, expires_at); CREATE INDEX idx_pending_source ON pending_actions(source_type, source_id); ``` --- ## Error Handling & Fallback Paths ### 1. Brain Query Timeout ```python # If Brain doesn't respond in 5s if brain_response.status == "timeout": # Log independent brain_relevance as NULL log_confidence(message_id, stage="brain", score=None, threshold=0.5, passed=False, decision="proceed_no_context") # Degrade gracefully: proceed without Brain context extraction["brain_context"] = None extraction["brain_relevance"] = None # Decision: Evaluate redline with extraction-only (lower threshold) redline_decision = evaluate_redline(extraction, has_brain_context=False) reason = "brain_timeout_degraded" ``` ### 2. Brain Returns Low Relevance ```python # If top score < 0.5 if brain_response.relevance_score < 0.5: # Log independent brain_relevance log_confidence(message_id, stage="brain", score=brain_response.relevance_score, threshold=0.5, passed=False, decision="proceed_no_context") # No useful context found extraction["brain_context"] = None # Decision: Depends on extraction_confidence (independent gate) if extraction["extraction_confidence"] > 0.8: redline_decision = evaluate_redline(extraction, has_brain_context=False) else: redline_decision = "SILENT" reason = "low_confidence_no_brain" ``` ### 3. Tier 2 Worker Crash ```python # Celery retry with exponential backoff @celery.task(bind=True, max_retries=3) def process_tier2(self, message_id): try: # ... processing logic ... except Exception as e: # Retry in 30s, 2min, 5min raise self.retry(countdown=30 * (2 ** self.request.retries)) # After 3 failures: mark for manual review if self.request.retries >= 3: mark_for_manual_review(message_id, error=str(e)) ``` ### 4. Duplicate Detection ```python # Prevent duplicate pending actions for same event def check_duplicate(extraction): similar = db.query(""" SELECT * FROM pending_actions WHERE action_type = ? AND json_extract(extracted_json, '$.dates') = ? AND json_extract(extracted_json, '$.child') = ? AND status = 'pending' AND created_at > datetime('now', '-1 hour') """, [extraction["type"], extraction["dates"], extraction["child"]]) if similar: # Merge or skip return "duplicate_detected" ``` --- ## Staging Environment Configuration ### File: `.env.staging.observer` ```bash # Environment ENV=staging DB_PATH=./data/staging.db FAMILY_CONTEXT=./TEST_FAMILY_CONTEXT.md # Telegram (Test Bot) TELEGRAM_BOT_TOKEN=${TELEGRAM_TEST_BOT_TOKEN} TELEGRAM_GROUP_CHAT_ID=${TELEGRAM_TEST_GROUP_ID} # Brain Integration BRAIN_BASE_URL=https://icarus-test.hoffdesk.com BRAIN_QUERY_ENDPOINT=/brain/query BRAIN_TIMEOUT_MS=5000 BRAIN_MIN_RELEVANCE=0.5 # Gate threshold for brain_relevance # LLM (Gaming PC via Tailscale) OLLAMA_HOST=http://matt-pc.tail864e81.ts.net:11434 OLLAMA_MODEL_TIER2=phi4:14b # Tripwire TRIPWIRE_THRESHOLD=0.7 # Independent gate threshold EXTRACTION_THRESHOLD=0.6 # Independent gate threshold TRIPWIRE_PATTERNS_PATH=./config/tripwire_patterns.json # Multi-Message Context CONTEXT_WINDOW_SIZE=3 CONTEXT_TTL_SECONDS=300 ATTRIBUTION_WINDOW_SECONDS=600 # Async Processing REDIS_URL=redis://localhost:6379/0 CELERY_WORKERS=2 # Redline Rules REDLINE_TEMPORAL_CONFLICT=true REDLINE_RESOURCE_CONFLICT=true REDLINE_MISSING_VARIABLE=true SPEAK_ON_MISSING_CRITICAL=true # Limits MAX_BRAIN_QUERIES_PER_MESSAGE=3 PENDING_ACTION_TTL_HOURS=48 MAX_MESSAGES_PER_MINUTE=60 # Brevity Constraint (Daedalus UX Spec) MAX_CONFLICT_MESSAGE_LENGTH=280 MAX_CLARIFICATION_LENGTH=200 MAX_SUMMARY_LENGTH=140 ``` ### Miller Family Test Context **Test Data Pre-Seeded:** - Leo's soccer: Tuesdays/Thursdays 4:30 PM @ Westside Park - Mia's ballet: Mondays/Wednesdays 4:00 PM @ Madison Dance Academy - Mia's swim: Saturdays 10:00 AM @ YMCA - Leo's chess: Wednesdays 3:30 PM - Spring break: March 24-28, 2026 - Honda Civic service: Due April 15, 2026 --- ## Implementation Phases ### Week 0.5: Data Collection & Labeling (NEW — Critical) **Before any implementation:** - [ ] Record 100 real messages from Family Logistics group - [ ] Hand-label for coordination signals: - `is_coordination`: true/false - `coordination_type`: transport | care_coverage | schedule_change | activity_confirm | none - `attribution_required`: true/false (for multi-message context) - [ ] Extract noisy data examples: - Typos: "socer", "pratice", "balet" - Abbreviations: "wk" (week), "2moro", "thx" - Emoji: "👍", "🙋", "✅" - Half-sentences: "yeah 3pm", "me", "i'll do it" - Mixed signals: "Leo soccer thursday??" (question vs statement) - [ ] Tune tripwire regex against noisy data - [ ] Establish baseline precision/recall before code is written **Noisy Data Test Cases (TC-006 to TC-020):** | TC | Message | Expected | Notes | |----|---------|----------|-------| | TC-006 | "socer tmrw 4" | Queue | Typo: "socer", abbreviation: "tmrw" | | TC-007 | "👍" (reply to coordination) | Queue | Emoji-only in context | | TC-008 | "yeah 3pm" | Queue | Half-sentence, needs context | | TC-009 | "i'll do it" | Queue | Pronoun reference, needs attribution | | TC-010 | "wk pickup" | Queue | Abbreviation: "wk" | | TC-011 | "pratice is cancled" | Queue | Multiple typos | | TC-012 | "thx!" | Drop | Gratitude, not coordination | | TC-013 | "Mia has bailt monday" | Queue | Typo in activity name | | TC-014 | "u sure?" | Depends | Question, usually drop unless context says otherwise | | TC-015 | "4:30 work for me" | Queue | Time-only, needs context | | TC-016 | "🙋‍♀️" | Queue | Self-assignment via emoji | | TC-017 | "kk" | Depends | Acknowledgment, context-dependent | | TC-018 | "Leo dentist 2mrw??" | Queue | Question mark + abbreviation | | TC-019 | "n/m" | Drop | "Never mind" — cancels previous | | TC-020 | "omw" | Queue | "On my way" — active coordination | --- ### Phase 8.1: Tripwire + Logger (Week 1) - [ ] Implement Tier 1 Tripwire (regex) with independent confidence scoring - [ ] Create `observer_messages` table with thread support - [ ] Create `observer_confidence_log` table for debugging - [ ] Log all messages, mark 95% as DROP - [ ] Deploy to staging, run for 3 days - [ ] Measure: false positive rate against Week 0.5 labeled data (target <5%) --- ### Phase 8.2: Async Tier 2 (Week 2) - [ ] Implement Celery worker for Tier 2 - [ ] Integration with Brain query endpoint (independent relevance scoring) - [ ] Create `pending_actions` table - [ ] Implement [Confirm] button flow - [ ] Deploy to staging, run for 3 days --- ### Phase 8.3: Redline Rules + Multi-Message Context (Week 3) — P0 Priority - [ ] Implement temporal conflict detection (Rule 1) - [ ] Implement resource conflict detection (Rule 2) - [ ] Implement missing variable detection (Rule 3) - [ ] Add multi-message context window (3 messages) - [ ] Implement attribution logic for pronoun resolution - [ ] Add SPEAK pathway for conflicts - [ ] Enforce brevity constraints (Daedalus UX spec) - [ ] Deploy to staging with Miller family **Multi-Message Test Scenarios:** | Scenario | Messages | Expected | Confidence | |----------|----------|----------|------------| | TC-021 | A: "Who's getting Leo?" / B: "me" | Attribution → John | extraction_confidence ≥ 0.75 | | TC-022 | A: "Soccer pickup Thursday?" / B: "I'll do it" / C: "thx" | Attribution + silent | C drops | | TC-023 | A: "Mia ballet Monday" / B: "I'll cover" / C: "👍" | Attribution + silent | C is emoji acknowledgment | | TC-024 | A: "Both kids Tuesday" / B: "I got Leo" / C: "I got Mia" | Double attribution + silent | Unless times conflict | --- ### Phase 8.4: Integration Test (Week 4) - [ ] Run full test scenarios (TC-001 to TC-024) - [ ] Measure: independent confidence gate accuracy - [ ] Measure: multi-message attribution accuracy - [ ] Measure: noisy data handling - [ ] Measure: family comfort metrics (see Success Metrics below) - [ ] Adjust thresholds based on results - [ ] Document learnings for production --- ## Success Metrics ### Primary Metric: Family Comfort (Non-Negotiable) **UX Acceptance Gate:** > "Aundrea has never asked 'why did the bot just say that?' for 7 consecutive days" This is the only metric that matters. Technical precision/recall are secondary to human experience. ### Secondary Metrics | Metric | Target | Measurement | |--------|--------|-------------| | Tripwire Precision | >95% | False positive rate on labeled data (Week 0.5) | | Tripwire Recall | >90% | % of coordination messages caught | | Extraction Confidence Accuracy | >85% | Correlation between extraction_confidence and human-verified correctness | | Brain Relevance Accuracy | >80% | Correlation between brain_relevance and context usefulness | | Multi-Message Attribution | >85% | Correct "I'll do it" → prior action resolution | | Brain Query Latency | <3s | Average response time | | Brain Query Success | >95% | % of queries returning valid response | | Redline Accuracy | >90% | Correct SPEAK vs SILENT decisions (measured by human review) | | Pending Confirmation Rate | >70% | % of silent extractions eventually confirmed | | Chat Interruptions | <2/day | SPEAK messages per day (excluding test scenarios) | | Aundrea Confusion Events | 0 in 7 days | Number of times primary user questions bot behavior | ### Anti-Metric (What NOT to Optimize) - **Don't optimize for:** Message extraction volume - **Don't optimize for:** Brain query coverage - **Don't optimize for:** Total features detected **Optimize for:** Family comfort. Full stop. --- ## Security & Privacy ### Data Handling - **No message storage beyond extraction:** Raw chat messages purged after 30 days - **No PII in Brain queries:** Queries are abstract ("soccer schedule" not "where is Leo") - **Local-first:** All processing on Beelink/Gaming PC, no cloud LLM for chat ### Telegram Privacy - **Privacy Mode:** Disabled for test bot (BotFather → Group Privacy → Off) - **Bot can see:** All group messages - **Bot cannot:** See direct messages unless replied ### Staging Isolation - **Database:** staging.db completely separate from prod.db - **Bot:** Separate test bot token - **Family:** Miller family test data only - **No production data:** Observer will NEVER see real Hoffmann family chat --- ## Open Questions 1. **Conflict Window:** How close is "conflict"? Same day? ±2 hours? → **Resolved:** Same day for now; temporal overlap detection TBD 2. **Override:** Can a parent "force" a silent extraction to speak? (e.g., urgent) → **Open:** Not for Phase 8 3. **Learning:** Should Observer learn from confirmed/rejected actions? (Phase 8.5?) → **Open:** Post-family-adoption 4. **Multi-message Context:** Should attribution expire? → **Resolved:** 10-minute window --- ## Appendix A: Tripwire Pattern Reference ```json { "tripwire_patterns": { "temporal": [ "\\b(tomorrow|today|next week|this week)\\b", "\\b(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2}\\b", "\\b(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\\b", "\\b\\d{1,2}:\\d{2}\\b", "\\b\\d{1,2}(\\s)?(am|pm)\\b" ], "assignment": [ "\\b(I('ll| will)|can) (pick up|get|cover|take|bring)\\b", "\\b(who('s| is) (getting|picking up|covering))\\b", "\\b(cover( for)?|swap|switch)\\b", "\\b(me|me too|i'll do it|i got it)\\b" ], "pronoun_attribution": [ "\\b(i'll|i will|i can|i'm|i am|me|me too)\\b" ], "children": [ "\\b(Leo|Mia)\\b", "\\b(kids?|children)\\b", "\\b(son|daughter)\\b" ], "activities": [ "\\b(soccer|ballet|swim|chess|practice|lesson|class)\\b", "\\b(game|match|meet|recital|performance)\\b" ], "conflict_markers": [ "\\b(but|wait|hold on|doesn't|didn't|conflict|overlap)\\b", "\\b(what about|how about)\\b" ], "noise_patterns": [ "\\b(thx|thanks|ty|👍|✅|🙋|kk|omw)\\b" ] } } ``` --- ## Appendix B: Mermaid Diagram (Simplified) ```mermaid flowchart TD A[Family Chat Message] --> B{Tier 1 Tripwire} B -->|tripwire_confidence < 0.7| C[Drop Silently] B -->|tripwire_confidence >= 0.7| D[Queue for Tier 2] D --> E[Async Tier 2 Worker] E --> F[Extract Coordination State] F -->|extraction_confidence < 0.6| C F -->|extraction_confidence >= 0.6| G[Query Brain for Context] G --> H{brain_relevance >= 0.5?} H -->|Yes| I[Redline with Brain Context] H -->|No| J[Redline Extraction-Only] I --> K{Redline Decision} J --> K K -->|No Conflict| L[Store as Pending] K -->|Conflict Detected| M[Speak in Chat] L --> N[[Confirm Button]] N -->|Confirmed| O[Write to Event Graph] N -->|Rejected| P[Discard] M --> Q[[Resolve Button]] Q --> R[Update Event Graph] ``` --- _The conversation is the interface. The brain provides memory. The observer knows when to speak. The family never asks "why did the bot just say that?"_