# Thoth Evaluation — Batch Results (2026-05-09) **Next Demo Day:** Saturday 2026-05-16 --- ## Summary After First Batch (33 messages) | Metric | Result | |--------|--------| | High confidence (≥0.7) | 27/33 (82%) | | Casually suppressed (correct) | 5/33 | | Issues found | 0 structural, 2-3 directional | | has_pet bug | ✅ FIXED (no longer misclassifies groceries) | ## What Works Well - **Calendar/scheduling:** Perfect extraction on all 6 test messages. Correctly identifies people, events, times, locations. - **School/activities:** Solid entity extraction. Picks up participants and event types accurately. - **Casual chat suppression:** 5/5 correctly filtered out (low confidence). - **Errand type detection:** Now correctly uses `responsible_for` instead of `has_pet` for tasks/objects. ## Remaining Issues 1. **Relation direction bugs (2-3 cases):** "Sully has soccer practice with Coach Mike" → `Sully--[coaches]-->Coach Mike` (reversed — Coach Mike is the coach). The LLM doesn't consistently respect the prompt's direction rules. 2. **"Mom spouse_of Matt":** For "Matt's mom coming over" → inferred `Mom--[spouse_of]-->Matt`. Missing context that "Mom" likely means Matt's mother, not his spouse. 3. **Missing injury semantics:** "Sully twisted his ankle" → extracted as `attends soccer practice` but doesn't flag injury signal. The entity-relation model is great for knowledge graphs, less useful for triggering case creation. ## Next Steps - [ ] Let shadow run for 1 week alongside production (2026-05-09 → 2026-05-16) - [ ] Review accumulated comparison logs before demo day - [ ] Fix direction bugs (consider adding few-shot examples to prompt with correct directions) - [ ] Add injury-specific relation types for RTSport use case (optional for general Icarus) ## How to Run Comparison Mode ```bash cd ~/icarus && python3 -c " from experiments.thoth.adapter import ThothExtractor from icarus.extractor import extract as prod_extract extractor = ThothExtractor() # Single message test result = extractor.extract_raw('Sully has soccer practice at 5pm') print(result) " ``` ## How to Enable Shadow Mode ```bash ICARUS_EXPERIMENT=true ICARUS_EXPERIMENT_NAME=thoth python3 -m icarus.daemon ``` This runs Thoth alongside production. Results are logged to `experiments/thoth/eval_results/`. Production is never modified.