Thoth Evaluation — Batch Results (2026-05-09)
Next Demo Day: Saturday 2026-05-16
Summary After First Batch (33 messages)
| Metric | Result |
|---|---|
| High confidence (≥0.7) | 27/33 (82%) |
| Casually suppressed (correct) | 5/33 |
| Issues found | 0 structural, 2-3 directional |
| has_pet bug | ✅ FIXED (no longer misclassifies groceries) |
What Works Well
- Calendar/scheduling: Perfect extraction on all 6 test messages. Correctly identifies people, events, times, locations.
- School/activities: Solid entity extraction. Picks up participants and event types accurately.
- Casual chat suppression: 5/5 correctly filtered out (low confidence).
- Errand type detection: Now correctly uses
responsible_forinstead ofhas_petfor tasks/objects.
Remaining Issues
-
Relation direction bugs (2-3 cases): "Sully has soccer practice with Coach Mike" →
Sully--[coaches]-->Coach Mike(reversed — Coach Mike is the coach). The LLM doesn't consistently respect the prompt's direction rules. -
"Mom spouse_of Matt": For "Matt's mom coming over" → inferred
Mom--[spouse_of]-->Matt. Missing context that "Mom" likely means Matt's mother, not his spouse. -
Missing injury semantics: "Sully twisted his ankle" → extracted as
attends soccer practicebut doesn't flag injury signal. The entity-relation model is great for knowledge graphs, less useful for triggering case creation.
Next Steps
- [ ] Let shadow run for 1 week alongside production (2026-05-09 → 2026-05-16)
- [ ] Review accumulated comparison logs before demo day
- [ ] Fix direction bugs (consider adding few-shot examples to prompt with correct directions)
- [ ] Add injury-specific relation types for RTSport use case (optional for general Icarus)
How to Run Comparison Mode
cd ~/icarus && python3 -c "
from experiments.thoth.adapter import ThothExtractor
from icarus.extractor import extract as prod_extract
extractor = ThothExtractor()
# Single message test
result = extractor.extract_raw('Sully has soccer practice at 5pm')
print(result)
"
How to Enable Shadow Mode
ICARUS_EXPERIMENT=true ICARUS_EXPERIMENT_NAME=thoth python3 -m icarus.daemon
This runs Thoth alongside production. Results are logged to experiments/thoth/eval_results/. Production is never modified.