Silent Observer — Brain Intelligence Integration Spec
Status: Draft v1.1 (Director Revisions)
Date: 2026-04-30
Author: Wadsworth (Chief of Staff)
Owner: Socrates (Backend Architecture)
Scope: Phase 8 Integration Layer
Executive Summary
This document specifies the integration architecture between the Silent Observer (Phase 8 — ambient chat monitoring) and the Brain Intelligence (Phase 6/7 — RAG-based knowledge retrieval). Together, they enable zero-UI household coordination where:
- The Observer listens to family chat without interrupting
- The Brain provides memory context for decisions
- Icarus speaks only when there's a real conflict or missing critical variable
Non-Negotiable Constraints (Director-Level):
1. State Protection (No Auto-Writes): All extractions require Human-in-the-Loop [Confirm]
2. Asynchronous Processing: Tier 1 releases chat thread immediately; Tier 2 is async
3. The 'Redline' Speak Rule: Silence is default; speak only on temporal/resource conflicts
4. Staging Environment Only: Connects only to staging.db with Miller Family context
Confidence Field Unification
CRITICAL: Confidence scores are NEVER multiplied or blended. Each stage has independent gates with clear thresholds.
Stage Gates (Sequential, Not Multiplicative)
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONFIDENCE GATE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Stage 1: Tripwire Confidence │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Calculated by: Regex pattern matcher (Tier 1) │
│ • Threshold: ≥0.7 to proceed to extraction │
│ • Logged field: tripwire_confidence (independent, 0.0-1.0) │
│ • If <0.7: DROP — message not queued for Tier 2 │
│ │
│ ↓ tripwire_confidence ≥ 0.7 │
│ │
│ Stage 2: Extraction Confidence │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Calculated by: LLM extraction model (Tier 2) │
│ • Threshold: ≥0.6 for usable extraction │
│ • Logged field: extraction_confidence (independent, 0.0-1.0) │
│ • If <0.6: DROP — low-quality extraction, don't query Brain │
│ │
│ ↓ extraction_confidence ≥ 0.6 │
│ │
│ Stage 3: Brain Relevance Score │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Calculated by: Brain RAG retriever │
│ • Threshold: ≥0.5 to use Brain context in redline decision │
│ • Logged field: brain_relevance (independent, 0.0-1.0) │
│ • If <0.5: Proceed without Brain context (extraction-only redline) │
│ │
│ ↓ brain_relevance ≥ 0.5 (optional) │
│ │
│ Stage 4: Redline Decision │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Calculated by: Rule-based conflict engine (no ML) │
│ • Output: SILENT | SPEAK │
│ • Logged field: redline_decision, redline_trigger_rule │
│ • Does NOT use confidence scores — uses extracted entities only │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Confidence Logging Schema
-- Each confidence is logged independently for debugging
CREATE TABLE observer_confidence_log (
message_id TEXT NOT NULL,
stage TEXT NOT NULL, -- 'tripwire' | 'extraction' | 'brain'
confidence_type TEXT, -- 'regex_score' | 'llm_score' | 'relevance'
score REAL NOT NULL,
threshold REAL NOT NULL,
passed BOOLEAN NOT NULL,
decision TEXT NOT NULL, -- 'proceed' | 'drop' | 'proceed_no_context'
logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Anti-Pattern: NEVER Do This
# WRONG — blending scores creates debugging nightmare
final_confidence = tripwire_score * extraction_score * brain_score
if final_confidence > 0.5: # What failed? Who knows.
...
# CORRECT — each gate independent, clear failure point
if tripwire_score < 0.7:
log("TRIPWIRE_REJECT", score=tripwire_score, threshold=0.7)
return DROP
if extraction_score < 0.6:
log("EXTRACTION_REJECT", score=extraction_score, threshold=0.6)
return DROP
# Brain relevance is optional — proceed with or without
use_brain = brain_relevance >= 0.5
Multi-Message Context Architecture
Status: P0 Priority — Moved to Phase 8.3 (was deferred)
Why: 60% of real coordination happens across multiple messages. Single-message extraction misses critical context.
The Problem
Message A (09:00): "Who's picking up Leo Thursday after soccer?"
Message B (09:15): "I'll do it"
↑
Who is "I"? What activity? What time?
Without Message A, Message B is unresolvable.
Context Window Specification
MULTI_MESSAGE_CONFIG = {
"context_window_size": 3, # Last 3 messages in thread
"context_ttl_seconds": 300, # Messages expire after 5 minutes
"attribution_window": 600, # Look back 10 min for matching context
"max_participants": 4, # Family group size limit
}
Attribution Logic
def resolve_attribution(current_message, context_window):
"""
Match "I'll do it" to the action in prior messages.
Returns: resolved_assignment or None
"""
# Step 1: Extract from current message
current = extract(current_message)
if current.get("assigned_to") is not None:
# Already has assignment (e.g., "John will pick up")
return current
# Step 2: Look for pronoun references in current message
pronouns = ["i'll", "i will", "i can", "i'm", "i am", "me"]
if not any(p in current_message.text.lower() for p in pronouns):
# Not an assignment message
return current
# Step 3: Search context window for matching coordination
for prior in reversed(context_window):
prior_extracted = prior.get("extraction", {})
# Match criteria:
# 1. Prior has missing assignment (who's/who is/can someone)
# 2. Same temporal scope (date overlap)
# 3. Same activity or child mentioned
if (
prior_extracted.get("assigned_to") == "unspecified" and
dates_overlap(current.get("dates"), prior_extracted.get("dates")) and
(activities_match(current.get("activity"), prior_extracted.get("activity")) or
children_match(current.get("child"), prior_extracted.get("child")))
):
# Attribution found
current["assigned_to"] = current_message.sender # "I" = sender of Message B
current["attributed_to_message_id"] = prior["message_id"]
current["attribution_confidence"] = 0.85
return current
# No match found — extraction incomplete
current["assigned_to"] = "unspecified"
current["attribution_confidence"] = 0.3
return current
Context-Aware Tripwire
# Tripwire now operates on message thread, not just single message
def tripwire_with_context(message, context_window):
# Score current message alone
base_score = pattern_match_score(message.text)
# Boost score if context suggests coordination thread
if is_coordination_thread(context_window):
# Prior messages contained questions about assignments
base_score = min(1.0, base_score + 0.15)
# Boost if current message is short response in context thread
if (
len(message.text.split()) <= 5 and # "I'll do it", "yeah me", "sure"
is_coordination_thread(context_window)
):
base_score = min(1.0, base_score + 0.20)
return base_score
Database Schema Updates
-- Track message threading for multi-message context
ALTER TABLE observer_messages ADD COLUMN thread_id TEXT;
ALTER TABLE observer_messages ADD COLUMN context_window TEXT; -- JSON array of prior message_ids
ALTER TABLE observer_messages ADD COLUMN attribution_source TEXT; -- message_id this was resolved from
-- Index for thread lookups
CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at);
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ FAMILY CHAT (Telegram Group) │
│ Members: John, Sarah, Icarus (bot) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: TRIPWIRE (Python/Regex) │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Latency: <10ms │
│ • Resource: CPU only (Beelink) │
│ • Pattern match: dates, times, coordination keywords │
│ • Context-aware: Scans last 3 messages for thread continuity │
│ • Immediate release of chat thread │
│ │
│ Pattern Categories: │
│ ├── Temporal: "tomorrow", "June 4th", "next week", "Tuesday 3pm" │
│ ├── Assignment: "I'll pick up", "can you cover", "who's getting" │
│ ├── Pronoun Resolution: "I'll do it", "me too", "yeah" (needs context) │
│ ├── Children: "Leo", "Mia", "kids", "the children" │
│ ├── Activities: "soccer", "ballet", "swim", "chess", "practice" │
│ └── Conflict markers: "but", "wait", "doesn't", "conflict", "overlap" │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
NO MATCH (≥95%) MATCH (≤5%)
┌─────────────────────┐ ┌─────────────────────┐
│ Drop Silently │ │ Queue for Tier 2 │
│ Log to observer_ │ │ (Redis/Queue) │
│ messages table │ │ Return 200 OK to │
│ (no processing) │ │ Telegram │
└─────────────────────┘ └─────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: ASYNC PROCESSING (8B LLM + Brain Query) │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Worker: Celery/Background task (Beelink) │
│ • LLM: phi4:14b or llama3.1:8b via Gaming PC (Tailscale) │
│ • Latency: 1-3s (acceptable — async) │
│ • Brain Query: HTTPS to icarus-test.hoffdesk.com/brain/query │
│ • Context: Fetches last 3 messages for multi-message resolution │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ COORDINATION STATE EXTRACTOR │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ │
│ Input: Chat message + Context window (last 3 messages) + Brain context │
│ Output: Structured extraction with independent confidence scores │
│ │
│ { │
│ "is_coordination": true | false, │
│ "coordination_type": "transport" | "care_coverage" | │
│ "schedule_change" | "activity_confirm" | null, │
│ "dates": ["2026-06-04", "2026-06-05"], │
│ "times": ["15:30", "16:00"], │
│ "assigned_to": ["john" | "sarah" | "unspecified"], │
│ "child": ["leo" | "mia" | "both" | null], │
│ "activity": "soccer" | "ballet" | "swim" | "chess" | null, │
│ "location": "Westside Park" | "Madison Dance Academy" | null, │
│ "action_required": "confirm" | "resolve_conflict" | "notify" | null, │
│ "extracted_entities": [...], │
│ "attribution": { │
│ "source_message_id": "...", │
│ "attribution_confidence": 0.85 │
│ }, │
│ "brain_query_context": {...} | null, │
│ "confidence_scores": { │
│ "tripwire_confidence": 0.85, │
│ "extraction_confidence": 0.72, │
│ "brain_relevance": 0.68 │
│ } │
│ } │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ REDLINE DECISION ENGINE │
│ ━━━━━━━━━━━━━━━━━━━━━━━━ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RULE 1: Temporal Conflict │ │
│ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │
│ │ Condition: Extracted date/time overlaps with existing Event Graph │ │
│ │ entry for same child + overlapping times │ │
│ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │
│ │ │ │
│ │ Action: SPEAK — "⚠️ Leo has soccer at 4:30pm Thursday, but this │ │
│ │ message mentions a dentist appointment at 4:00pm. │ │
│ │ Which is correct?" │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RULE 2: Resource Conflict │ │
│ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │
│ │ Condition: Both parents assigned to conflicting tasks at same time│ │
│ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │
│ │ │ │
│ │ Action: SPEAK — "🚨 John and Sarah both said they're covering │ │
│ │ Tuesday 3pm pickup. Who's getting the kids?" │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RULE 3: Missing Critical Variable │ │
│ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │
│ │ Condition: Coordination detected but 'assigned_to' or 'child' │ │
│ │ is null AND activity is time-sensitive (<24h) │ │
│ │ Confidence gate: extraction_confidence ≥ 0.6 (already verified) │ │
│ │ │ │
│ │ Action: SPEAK — "⏳ I see 'someone' needs to cover Thursday 3pm │ │
│ │ for Leo's early dismissal. Who's handling pickup?" │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ RULE 4: All Other Cases │ │
│ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │
│ │ Condition: No conflict, no missing critical variable │ │
│ │ │ │
│ │ Action: SILENT — Store extraction as 'pending' with [Confirm] │ │
│ │ button (no chat message) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌─────────┐ ┌─────────────┐
│ SILENT │ │ SPEAK │
└────┬────┘ └──────┬──────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌─────────────────────────────────────┐
│ STORE AS PENDING │ │ POST TO GROUP CHAT │
│ ━━━━━━━━━━━━━━━━━━━━━━━━ │ │ ━━━━━━━━━━━━━━━━━━━━━━━ │
│ │ │ │
│ Table: pending_actions │ │ Message includes: │
│ Status: 'unconfirmed' │ │ • Clear conflict description │
│ UI: Telegram inline │ │ • Suggested resolution │
│ [Confirm] button │ │ • [Resolve] inline button │
│ │ │ │
│ Human can review via: │ │ Example: │
│ • /pending command │ │ "⚠️ Conflict detected: │
│ • Dashboard │ │ Leo has soccer 4:30pm Tuesday, │
│ │ │ but message mentions chess club │
│ [Confirm] → │ │ at 4:00pm. Both?" │
│ writes to Event Graph │ │ │
└──────────────────────────┘ └─────────────────────────────────────┘
Brevity Constraint (Cross-Reference Daedalus UX Specs)
Source: Daedalus UX Design System — Conversational Agents v2.1
Maximum Message Lengths
| Message Type | Max Length | Rationale |
|---|---|---|
| Conflict alert | 280 chars | Fits in single Telegram bubble |
| Clarification request | 200 chars | Quick to read, easy to answer |
| Confirmation summary | 140 chars | Twitter-length, scannable |
Brevity Patterns
BREVITY_TEMPLATES = {
"temporal_conflict": "⚠️ {child} has {activity} at {time}, but message says {conflict}. Which is correct?",
"resource_conflict": "🚨 Both {parent1} and {parent2} claim {task} on {day}. Who's covering?",
"missing_assignment": "⏳ Someone needs to cover {child}'s {activity} on {day}. Who?",
"missing_child": "⏳ {activity} mentioned for {day} — which child?",
}
Anti-Patterns (Never Do)
❌ "I noticed that in your message at 9:15 AM, you mentioned..."
❌ "Based on my analysis of the conversation history..."
❌ "It appears there may be a potential scheduling conflict..."
❌ Multi-paragraph explanations
✅ "⚠️ Leo has soccer 4:30pm Thursday. Message says dentist 4pm. Which?"
✅ "🚨 Both John and Sarah claim Tuesday pickup. Who's covering?"
✅ "⏳ Someone needs to cover Leo Thursday 3pm. Who?"
API Contract: Observer ↔ Brain
1. Brain Query Endpoint
Current: https://icarus-test.hoffdesk.com/brain/query?q={question}
For Observer Integration: Extend with structured query support
GET /brain/query?q={question}&context=observer&format=json
Headers:
X-Observer-Request: true
X-Family-Context: miller # staging only
Response (current):
{
"answer": "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM...",
"sources": [...],
"confidence": "high"
}
Response (extended for Observer):
{
"answer": "...",
"sources": [...],
"confidence": "high",
"relevance_score": 0.769, # ← Used for brain_relevance gate
"temporal_entities": [
{"type": "time", "value": "16:30", "context": "soccer practice"},
{"type": "day", "value": "Tuesday", "context": "recurring"}
],
"extracted_events": [
{
"summary": "Leo Soccer Practice",
"start": "2026-05-05T16:30:00",
"location": "Westside Park"
}
]
}
2. Observer → Brain Query Patterns
The Observer queries the Brain for context before making Redline decisions:
| Observer Detected | Brain Query | Purpose |
|---|---|---|
| "Leo soccer Thursday" | "When is Leo's soccer practice?" | Verify against known schedule |
| "Mia has ballet Monday" | "What days does Mia have ballet?" | Check for conflicts |
| "pick up tomorrow 3pm" | "Who usually picks up the kids on [day]?" | Resource assignment history |
| "early dismissal Friday" | "Any events on Friday involving school?" | Temporal conflict check |
3. Brain Query Constraints
# Observer-specific query limits
OBSERVER_BRAIN_CONFIG = {
"max_queries_per_message": 3, # Prevent query spam
"query_timeout_ms": 5000, # Fail fast if Brain slow
"cache_ttl_seconds": 60, # Cache recent queries
"min_relevance_threshold": 0.5, # Gate threshold (logged as brain_relevance)
"staging_only": True, # Enforce staging environment
}
Data Flow: Complete Walkthrough
Example 1: Conflict Detection (Speak)
Chat Message (Sarah → Group):
"I'll pick up Leo from school tomorrow at 3pm and take him to soccer"
Step-by-Step:
-
Tripwire Match (Tier 1)
- Patterns: "pick up", "tomorrow", "3pm", "soccer", "Leo"
- tripwire_confidence: 0.85 (≥ 0.7 threshold ✓)
- Action: Queue for Tier 2 -
Async Processing (Tier 2)
- Extract:{"date": "2026-05-01", "time": "15:00", "child": "leo", "activity": "soccer", "assigned_to": "sarah"}
- extraction_confidence: 0.78 (≥ 0.6 threshold ✓) -
Brain Query
- Query: "What time is Leo's soccer practice?"
- Response: "Leo's soccer practice is Tuesdays and Thursdays at 4:30 PM at Westside Park"
- brain_relevance: 0.82 (≥ 0.5 threshold ✓, use context) -
Redline Check
- Message says: 3pm pickup, soccer implied
- Brain says: Soccer is 4:30pm (1.5h later)
- No direct conflict, BUT time gap is suspicious -
Decision: SPEAK (Rule 3 variant — clarification needed)
"⏳ Leo pickup 3pm for soccer, but practice is 4:30pm. Different activity?"
Example 2: Silent Storage (No Conflict)
Chat Message (John → Group):
"Got the oil changed today. Next due in July."
Step-by-Step:
- Tripwire Match
- Patterns: "today", "July", "oil changed"
- tripwire_confidence: 0.65 (< 0.7 threshold ✗)
- Action: DROP — not coordination-related
Example 3: Multi-Message Context Resolution (Silent)
Message A (Sarah, 09:00):
"Who's picking up Leo Thursday after soccer?"
Message B (John, 09:15):
"I'll do it"
Step-by-Step:
-
Message A Tripwire
- tripwire_confidence: 0.88 (question about assignment)
- Queued for Tier 2
- Extraction:{"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "unspecified"}
- Stored with thread_id -
Message B Tripwire (Context-Aware)
- Base score: 0.45 (just "I'll do it" — low confidence alone)
- Context boost: +0.20 (short response in coordination thread)
- Final tripwire_confidence: 0.65 (< 0.7 threshold...)
- BUT: Attribution pattern detected
- Override: Queue for Tier 2 with context window -
Tier 2 with Context
- Fetches Message A via thread_id
- Attribution logic: John (sender of B) → assignment from A
- Final extraction:{"date": "2026-05-08", "child": "leo", "activity": "soccer", "assigned_to": "john", "attribution_confidence": 0.85} -
Brain Query
- Query: "When is Leo's soccer practice?"
- brain_relevance: 0.91 (confirms Thursday 4:30pm) -
Redline Check
- Assigned to: John
- Brain context: No conflict detected
- Decision: SILENT -
Store as Pending
- Type:care_assignment
- Summary: "John assigned to pick up Leo from soccer Thursday 4:30pm"
- [Confirm] button available
Example 4: Resource Conflict (Speak)
Chat Message 1 (John, 09:00):
"I can cover Tuesday pickup"
Chat Message 2 (Sarah, 09:15):
"I'll get the kids Tuesday after work"
Step-by-Step:
-
Tripwire Matches
- Both messages match assignment patterns
- tripwire_confidence: 0.79, 0.82
- Both queued for Tier 2 -
Async Processing
- Extract John:{"date": "2026-05-06", "assigned_to": "john", "task": "pickup"}- extraction_confidence: 0.74
- Extract Sarah:
{"date": "2026-05-06", "assigned_to": "sarah", "task": "pickup"} - extraction_confidence: 0.81
-
Brain Query
- Query for both: "Who usually picks up kids on Tuesdays?"
- brain_relevance: 0.76 (historical pattern: usually John) -
Redline Check
- Same date
- Same task (pickup)
- Different assignees
- RESOURCE CONFLICT (Rule 2) -
Decision: SPEAK
"🚨 John and Sarah both claim Tuesday pickup. Who's covering?" [John] [Sarah] [Both — carpool]
Database Schema
observer_messages — Raw Chat Log
CREATE TABLE observer_messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
message_id TEXT UNIQUE NOT NULL, -- Telegram message ID
chat_id TEXT NOT NULL, -- Group chat ID
thread_id TEXT, -- Links messages in same conversation
sender TEXT NOT NULL, -- john | sarah | icarus
message_text TEXT NOT NULL,
sent_at TIMESTAMP NOT NULL,
context_window TEXT, -- JSON array of prior message_ids
-- Tier 1 Tripwire (Independent confidence)
tripwire_confidence REAL, -- 0.0-1.0, threshold 0.7
tripwire_patterns TEXT, -- JSON array of matched patterns
tier1_decision TEXT, -- DROP | QUEUE
-- Tier 2 Processing
processed_at TIMESTAMP,
extracted_json TEXT, -- Full extraction JSON
attribution_source TEXT, -- message_id this was resolved from
-- Brain Query (Independent relevance score)
brain_queries TEXT, -- JSON array of Brain queries made
brain_context TEXT, -- Brain responses
brain_relevance REAL, -- 0.0-1.0, threshold 0.5
-- Redline Decision (Rule-based, no confidence score)
redline_decision TEXT, -- SILENT | SPEAK
redline_trigger_rule TEXT, -- temporal_conflict | resource_conflict | missing_variable
-- Outcome
action_id TEXT, -- FK to pending_actions (if SILENT)
notification_id TEXT, -- Telegram message ID (if SPEAK)
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_observer_thread ON observer_messages(thread_id, sent_at);
CREATE INDEX idx_observer_tripwire ON observer_messages(tripwire_confidence, tier1_decision);
CREATE INDEX idx_observer_decision ON observer_messages(redline_decision, processed_at);
observer_confidence_log — Debug Trail
CREATE TABLE observer_confidence_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
message_id TEXT NOT NULL,
stage TEXT NOT NULL, -- 'tripwire' | 'extraction' | 'brain'
confidence_type TEXT, -- 'regex_score' | 'llm_score' | 'relevance'
score REAL NOT NULL,
threshold REAL NOT NULL,
passed BOOLEAN NOT NULL,
decision TEXT NOT NULL, -- 'proceed' | 'drop' | 'proceed_no_context'
logged_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_confidence_msg ON observer_confidence_log(message_id, stage);
pending_actions — Human-in-the-Loop Queue
CREATE TABLE pending_actions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
action_id TEXT UNIQUE NOT NULL, -- UUID for this action
-- Source
source_type TEXT NOT NULL, -- OBSERVER | EMAIL_MANUAL | etc
source_id TEXT, -- FK to observer_messages or email_id
-- Extracted State (what the system thinks it heard)
action_type TEXT NOT NULL, -- maintenance_update | event_create |
-- care_assignment | schedule_change
extracted_json TEXT NOT NULL, -- Full extraction
extraction_confidence REAL NOT NULL, -- Independent confidence (not blended)
-- Human Review State
status TEXT NOT NULL DEFAULT 'pending', -- pending | confirmed | rejected | expired
suggested_event_json TEXT, -- What would be written to Event Graph
-- UI/UX (Brevity constraint enforced)
summary_text TEXT NOT NULL, -- Human-readable summary (max 140 chars)
confirm_payload TEXT, -- JSON to send on confirm
-- Expiration
expires_at TIMESTAMP, -- Auto-expire if not confirmed
confirmed_at TIMESTAMP,
confirmed_by TEXT, -- john | sarah
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_pending_status ON pending_actions(status, expires_at);
CREATE INDEX idx_pending_source ON pending_actions(source_type, source_id);
Error Handling & Fallback Paths
1. Brain Query Timeout
# If Brain doesn't respond in 5s
if brain_response.status == "timeout":
# Log independent brain_relevance as NULL
log_confidence(message_id, stage="brain", score=None, threshold=0.5,
passed=False, decision="proceed_no_context")
# Degrade gracefully: proceed without Brain context
extraction["brain_context"] = None
extraction["brain_relevance"] = None
# Decision: Evaluate redline with extraction-only (lower threshold)
redline_decision = evaluate_redline(extraction, has_brain_context=False)
reason = "brain_timeout_degraded"
2. Brain Returns Low Relevance
# If top score < 0.5
if brain_response.relevance_score < 0.5:
# Log independent brain_relevance
log_confidence(message_id, stage="brain", score=brain_response.relevance_score,
threshold=0.5, passed=False, decision="proceed_no_context")
# No useful context found
extraction["brain_context"] = None
# Decision: Depends on extraction_confidence (independent gate)
if extraction["extraction_confidence"] > 0.8:
redline_decision = evaluate_redline(extraction, has_brain_context=False)
else:
redline_decision = "SILENT"
reason = "low_confidence_no_brain"
3. Tier 2 Worker Crash
# Celery retry with exponential backoff
@celery.task(bind=True, max_retries=3)
def process_tier2(self, message_id):
try:
# ... processing logic ...
except Exception as e:
# Retry in 30s, 2min, 5min
raise self.retry(countdown=30 * (2 ** self.request.retries))
# After 3 failures: mark for manual review
if self.request.retries >= 3:
mark_for_manual_review(message_id, error=str(e))
4. Duplicate Detection
# Prevent duplicate pending actions for same event
def check_duplicate(extraction):
similar = db.query("""
SELECT * FROM pending_actions
WHERE action_type = ?
AND json_extract(extracted_json, '$.dates') = ?
AND json_extract(extracted_json, '$.child') = ?
AND status = 'pending'
AND created_at > datetime('now', '-1 hour')
""", [extraction["type"], extraction["dates"], extraction["child"]])
if similar:
# Merge or skip
return "duplicate_detected"
Staging Environment Configuration
File: .env.staging.observer
# Environment
ENV=staging
DB_PATH=./data/staging.db
FAMILY_CONTEXT=./TEST_FAMILY_CONTEXT.md
# Telegram (Test Bot)
TELEGRAM_BOT_TOKEN=${TELEGRAM_TEST_BOT_TOKEN}
TELEGRAM_GROUP_CHAT_ID=${TELEGRAM_TEST_GROUP_ID}
# Brain Integration
BRAIN_BASE_URL=https://icarus-test.hoffdesk.com
BRAIN_QUERY_ENDPOINT=/brain/query
BRAIN_TIMEOUT_MS=5000
BRAIN_MIN_RELEVANCE=0.5 # Gate threshold for brain_relevance
# LLM (Gaming PC via Tailscale)
OLLAMA_HOST=http://matt-pc.tail864e81.ts.net:11434
OLLAMA_MODEL_TIER2=phi4:14b
# Tripwire
TRIPWIRE_THRESHOLD=0.7 # Independent gate threshold
EXTRACTION_THRESHOLD=0.6 # Independent gate threshold
TRIPWIRE_PATTERNS_PATH=./config/tripwire_patterns.json
# Multi-Message Context
CONTEXT_WINDOW_SIZE=3
CONTEXT_TTL_SECONDS=300
ATTRIBUTION_WINDOW_SECONDS=600
# Async Processing
REDIS_URL=redis://localhost:6379/0
CELERY_WORKERS=2
# Redline Rules
REDLINE_TEMPORAL_CONFLICT=true
REDLINE_RESOURCE_CONFLICT=true
REDLINE_MISSING_VARIABLE=true
SPEAK_ON_MISSING_CRITICAL=true
# Limits
MAX_BRAIN_QUERIES_PER_MESSAGE=3
PENDING_ACTION_TTL_HOURS=48
MAX_MESSAGES_PER_MINUTE=60
# Brevity Constraint (Daedalus UX Spec)
MAX_CONFLICT_MESSAGE_LENGTH=280
MAX_CLARIFICATION_LENGTH=200
MAX_SUMMARY_LENGTH=140
Miller Family Test Context
Test Data Pre-Seeded:
- Leo's soccer: Tuesdays/Thursdays 4:30 PM @ Westside Park
- Mia's ballet: Mondays/Wednesdays 4:00 PM @ Madison Dance Academy
- Mia's swim: Saturdays 10:00 AM @ YMCA
- Leo's chess: Wednesdays 3:30 PM
- Spring break: March 24-28, 2026
- Honda Civic service: Due April 15, 2026
Implementation Phases
Week 0.5: Data Collection & Labeling (NEW — Critical)
Before any implementation:
- [ ] Record 100 real messages from Family Logistics group
- [ ] Hand-label for coordination signals:
is_coordination: true/falsecoordination_type: transport | care_coverage | schedule_change | activity_confirm | noneattribution_required: true/false (for multi-message context)- [ ] Extract noisy data examples:
- Typos: "socer", "pratice", "balet"
- Abbreviations: "wk" (week), "2moro", "thx"
- Emoji: "👍", "🙋", "✅"
- Half-sentences: "yeah 3pm", "me", "i'll do it"
- Mixed signals: "Leo soccer thursday??" (question vs statement)
- [ ] Tune tripwire regex against noisy data
- [ ] Establish baseline precision/recall before code is written
Noisy Data Test Cases (TC-006 to TC-020):
| TC | Message | Expected | Notes |
|---|---|---|---|
| TC-006 | "socer tmrw 4" | Queue | Typo: "socer", abbreviation: "tmrw" |
| TC-007 | "👍" (reply to coordination) | Queue | Emoji-only in context |
| TC-008 | "yeah 3pm" | Queue | Half-sentence, needs context |
| TC-009 | "i'll do it" | Queue | Pronoun reference, needs attribution |
| TC-010 | "wk pickup" | Queue | Abbreviation: "wk" |
| TC-011 | "pratice is cancled" | Queue | Multiple typos |
| TC-012 | "thx!" | Drop | Gratitude, not coordination |
| TC-013 | "Mia has bailt monday" | Queue | Typo in activity name |
| TC-014 | "u sure?" | Depends | Question, usually drop unless context says otherwise |
| TC-015 | "4:30 work for me" | Queue | Time-only, needs context |
| TC-016 | "🙋♀️" | Queue | Self-assignment via emoji |
| TC-017 | "kk" | Depends | Acknowledgment, context-dependent |
| TC-018 | "Leo dentist 2mrw??" | Queue | Question mark + abbreviation |
| TC-019 | "n/m" | Drop | "Never mind" — cancels previous |
| TC-020 | "omw" | Queue | "On my way" — active coordination |
Phase 8.1: Tripwire + Logger (Week 1)
- [ ] Implement Tier 1 Tripwire (regex) with independent confidence scoring
- [ ] Create
observer_messagestable with thread support - [ ] Create
observer_confidence_logtable for debugging - [ ] Log all messages, mark 95% as DROP
- [ ] Deploy to staging, run for 3 days
- [ ] Measure: false positive rate against Week 0.5 labeled data (target <5%)
Phase 8.2: Async Tier 2 (Week 2)
- [ ] Implement Celery worker for Tier 2
- [ ] Integration with Brain query endpoint (independent relevance scoring)
- [ ] Create
pending_actionstable - [ ] Implement [Confirm] button flow
- [ ] Deploy to staging, run for 3 days
Phase 8.3: Redline Rules + Multi-Message Context (Week 3) — P0 Priority
- [ ] Implement temporal conflict detection (Rule 1)
- [ ] Implement resource conflict detection (Rule 2)
- [ ] Implement missing variable detection (Rule 3)
- [ ] Add multi-message context window (3 messages)
- [ ] Implement attribution logic for pronoun resolution
- [ ] Add SPEAK pathway for conflicts
- [ ] Enforce brevity constraints (Daedalus UX spec)
- [ ] Deploy to staging with Miller family
Multi-Message Test Scenarios:
| Scenario | Messages | Expected | Confidence |
|---|---|---|---|
| TC-021 | A: "Who's getting Leo?" / B: "me" | Attribution → John | extraction_confidence ≥ 0.75 |
| TC-022 | A: "Soccer pickup Thursday?" / B: "I'll do it" / C: "thx" | Attribution + silent | C drops |
| TC-023 | A: "Mia ballet Monday" / B: "I'll cover" / C: "👍" | Attribution + silent | C is emoji acknowledgment |
| TC-024 | A: "Both kids Tuesday" / B: "I got Leo" / C: "I got Mia" | Double attribution + silent | Unless times conflict |
Phase 8.4: Integration Test (Week 4)
- [ ] Run full test scenarios (TC-001 to TC-024)
- [ ] Measure: independent confidence gate accuracy
- [ ] Measure: multi-message attribution accuracy
- [ ] Measure: noisy data handling
- [ ] Measure: family comfort metrics (see Success Metrics below)
- [ ] Adjust thresholds based on results
- [ ] Document learnings for production
Success Metrics
Primary Metric: Family Comfort (Non-Negotiable)
UX Acceptance Gate:
"Aundrea has never asked 'why did the bot just say that?' for 7 consecutive days"
This is the only metric that matters. Technical precision/recall are secondary to human experience.
Secondary Metrics
| Metric | Target | Measurement |
|---|---|---|
| Tripwire Precision | >95% | False positive rate on labeled data (Week 0.5) |
| Tripwire Recall | >90% | % of coordination messages caught |
| Extraction Confidence Accuracy | >85% | Correlation between extraction_confidence and human-verified correctness |
| Brain Relevance Accuracy | >80% | Correlation between brain_relevance and context usefulness |
| Multi-Message Attribution | >85% | Correct "I'll do it" → prior action resolution |
| Brain Query Latency | <3s | Average response time |
| Brain Query Success | >95% | % of queries returning valid response |
| Redline Accuracy | >90% | Correct SPEAK vs SILENT decisions (measured by human review) |
| Pending Confirmation Rate | >70% | % of silent extractions eventually confirmed |
| Chat Interruptions | <2/day | SPEAK messages per day (excluding test scenarios) |
| Aundrea Confusion Events | 0 in 7 days | Number of times primary user questions bot behavior |
Anti-Metric (What NOT to Optimize)
- Don't optimize for: Message extraction volume
- Don't optimize for: Brain query coverage
- Don't optimize for: Total features detected
Optimize for: Family comfort. Full stop.
Security & Privacy
Data Handling
- No message storage beyond extraction: Raw chat messages purged after 30 days
- No PII in Brain queries: Queries are abstract ("soccer schedule" not "where is Leo")
- Local-first: All processing on Beelink/Gaming PC, no cloud LLM for chat
Telegram Privacy
- Privacy Mode: Disabled for test bot (BotFather → Group Privacy → Off)
- Bot can see: All group messages
- Bot cannot: See direct messages unless replied
Staging Isolation
- Database: staging.db completely separate from prod.db
- Bot: Separate test bot token
- Family: Miller family test data only
- No production data: Observer will NEVER see real Hoffmann family chat
Open Questions
- Conflict Window: How close is "conflict"? Same day? ±2 hours? → Resolved: Same day for now; temporal overlap detection TBD
- Override: Can a parent "force" a silent extraction to speak? (e.g., urgent) → Open: Not for Phase 8
- Learning: Should Observer learn from confirmed/rejected actions? (Phase 8.5?) → Open: Post-family-adoption
- Multi-message Context: Should attribution expire? → Resolved: 10-minute window
Appendix A: Tripwire Pattern Reference
{
"tripwire_patterns": {
"temporal": [
"\\b(tomorrow|today|next week|this week)\\b",
"\\b(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2}\\b",
"\\b(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\\b",
"\\b\\d{1,2}:\\d{2}\\b",
"\\b\\d{1,2}(\\s)?(am|pm)\\b"
],
"assignment": [
"\\b(I('ll| will)|can) (pick up|get|cover|take|bring)\\b",
"\\b(who('s| is) (getting|picking up|covering))\\b",
"\\b(cover( for)?|swap|switch)\\b",
"\\b(me|me too|i'll do it|i got it)\\b"
],
"pronoun_attribution": [
"\\b(i'll|i will|i can|i'm|i am|me|me too)\\b"
],
"children": [
"\\b(Leo|Mia)\\b",
"\\b(kids?|children)\\b",
"\\b(son|daughter)\\b"
],
"activities": [
"\\b(soccer|ballet|swim|chess|practice|lesson|class)\\b",
"\\b(game|match|meet|recital|performance)\\b"
],
"conflict_markers": [
"\\b(but|wait|hold on|doesn't|didn't|conflict|overlap)\\b",
"\\b(what about|how about)\\b"
],
"noise_patterns": [
"\\b(thx|thanks|ty|👍|✅|🙋|kk|omw)\\b"
]
}
}
Appendix B: Mermaid Diagram (Simplified)
flowchart TD
A[Family Chat Message] --> B{Tier 1 Tripwire}
B -->|tripwire_confidence < 0.7| C[Drop Silently]
B -->|tripwire_confidence >= 0.7| D[Queue for Tier 2]
D --> E[Async Tier 2 Worker]
E --> F[Extract Coordination State]
F -->|extraction_confidence < 0.6| C
F -->|extraction_confidence >= 0.6| G[Query Brain for Context]
G --> H{brain_relevance >= 0.5?}
H -->|Yes| I[Redline with Brain Context]
H -->|No| J[Redline Extraction-Only]
I --> K{Redline Decision}
J --> K
K -->|No Conflict| L[Store as Pending]
K -->|Conflict Detected| M[Speak in Chat]
L --> N[[Confirm Button]]
N -->|Confirmed| O[Write to Event Graph]
N -->|Rejected| P[Discard]
M --> Q[[Resolve Button]]
Q --> R[Update Event Graph]
The conversation is the interface. The brain provides memory. The observer knows when to speak. The family never asks "why did the bot just say that?"