📄 EVAL_NOTES.md 2,362 bytes Today 14:13 📋 Raw

Thoth Evaluation — Batch Results (2026-05-09)

Next Demo Day: Saturday 2026-05-16


Summary After First Batch (33 messages)

Metric Result
High confidence (≥0.7) 27/33 (82%)
Casually suppressed (correct) 5/33
Issues found 0 structural, 2-3 directional
has_pet bug ✅ FIXED (no longer misclassifies groceries)

What Works Well

  • Calendar/scheduling: Perfect extraction on all 6 test messages. Correctly identifies people, events, times, locations.
  • School/activities: Solid entity extraction. Picks up participants and event types accurately.
  • Casual chat suppression: 5/5 correctly filtered out (low confidence).
  • Errand type detection: Now correctly uses responsible_for instead of has_pet for tasks/objects.

Remaining Issues

  1. Relation direction bugs (2-3 cases): "Sully has soccer practice with Coach Mike" → Sully--[coaches]-->Coach Mike (reversed — Coach Mike is the coach). The LLM doesn't consistently respect the prompt's direction rules.

  2. "Mom spouse_of Matt": For "Matt's mom coming over" → inferred Mom--[spouse_of]-->Matt. Missing context that "Mom" likely means Matt's mother, not his spouse.

  3. Missing injury semantics: "Sully twisted his ankle" → extracted as attends soccer practice but doesn't flag injury signal. The entity-relation model is great for knowledge graphs, less useful for triggering case creation.

Next Steps

  • [ ] Let shadow run for 1 week alongside production (2026-05-09 → 2026-05-16)
  • [ ] Review accumulated comparison logs before demo day
  • [ ] Fix direction bugs (consider adding few-shot examples to prompt with correct directions)
  • [ ] Add injury-specific relation types for RTSport use case (optional for general Icarus)

How to Run Comparison Mode

cd ~/icarus && python3 -c "
from experiments.thoth.adapter import ThothExtractor
from icarus.extractor import extract as prod_extract

extractor = ThothExtractor()
# Single message test
result = extractor.extract_raw('Sully has soccer practice at 5pm')
print(result)
"

How to Enable Shadow Mode

ICARUS_EXPERIMENT=true ICARUS_EXPERIMENT_NAME=thoth python3 -m icarus.daemon

This runs Thoth alongside production. Results are logged to experiments/thoth/eval_results/. Production is never modified.