Session: 2026-04-17 16:56:02 UTC
- Session Key: agent:main:telegram:direct:8386527252
- Session ID: 6c7725da-ac66-45c4-8a97-1dbf3821c260
- Source: telegram
Conversation Summary
assistant: Wait β eval count is 307 but the response is tiny? The thinking tokens are still being counted even with /no_think. Let me try a real test:
assistant: OK β it works, but 9.2s vs 0.8s is a huge gap. The /no_think tag works but qwen3:8b is still doing internal processing. Let me run a proper comparison with /no_think and higher token limits:
assistant: The verdict is clear:
| Metric | qwen2.5-coder:7b | qwen3:8b |
|---|---|---|
| Latency | 0.5β1.4s | 2.8β18s |
| Speed | 138β153 tok/s | 120 tok/s |
| JSON validity | 4/4 β | 2/4 β |
| Token efficiency | 50β157 tokens | 314β2048 tokens |
qwen3:8b is worse on every metric for our use case. The /no_think tag isn't fully disabling the thinking β qwen3 is burning 500-2000 tokens on internal reasoning even when told not to, then hitting the token limit before completing the JSON output. It's 3-13x slower and fails half the tests.
The problem: qwen3 is designed as a reasoning model first. Even in non-thinking mode, it's doing substantial internal work. For our family assistant tasks (structured JSON extraction, intent parsing), we need fast, reliable, concise output β exactly what qwen2.5-coder:7b is optimized for.
Recommendation: Keep qwen2.5-coder:7b as the default parser. qwen3:8b isn't an upgrade for this workload β it's a downgrade. The qwen3 architecture shines on complex reasoning tasks, but our pipeline needs speed and reliability on structured extraction, which is qwen2.5-coder's exact strength.
Want me to clean up the qwen3:8b pull (keep it for occasional reasoning tasks), or remove it to free the 5GB of disk space?
user: System (untrusted): [2026-04-17 16:26:57 UTC] Exec completed (delta-me, code 255) :: ssh: connect to host 100.104.147.116 port 22: Connection timed out
System (untrusted): [2026-04-17 16:28:34 UTC] Exec completed (fresh-ha, code 0) :: {"status":"success"}
System (untrusted): [2026-04-17 16:29:51 UTC] Exec completed (good-dai, code 0) :: qwen2.5-coder:7b qwen3:8b ----------------------------------------------------------------- appointment_extract β
0.8s 139/s β 4.5s 121/s intent_parse β
0.5s 149/s β
3.6s 120/s reβ¦
System (untrusted): [2026-04-17 16:33:54 UTC] Exec completed (faint-fo, code 0) :: 2.8s | qwen2.5 recurring | 1.1s | 17.9s | qwen2.5 conflict | 1.4s | 18.0s | qwen2.5 ================================================================= JSON VALIDITY ===============β¦
Current time: Friday, April 17th, 2026 - 4:34 PM (UTC) / 2026-04-17 16:34 UTC
assistant: (Async completions already handled inline β no new info to surface.)
user: Conversation info (untrusted metadata):
{
"chat_id": "telegram:8386527252",
"message_id": "2878",
"sender_id": "8386527252",
"sender": "Matt",
"timestamp": "Fri 2026-04-17 16:37 UTC"
}
Sender (untrusted metadata):
{
"label": "Matt (8386527252)",
"id": "8386527252",
"name": "Matt",
"username": "Hoffmann_Matthew"
}
So the only other goal was to evaluate deepseek models locally for some reasoning tasks. Do you want to dive into that model research for our needs and see if thatβs a better option than the qwen model for βoccasional local thinking?β!
assistant: Let me check what deepseek options exist on Ollama and what fits our VRAM:
assistant: Interesting β the default deepseek-r1 tag (latest) is now a Qwen3-8B distill, 5.2GB. Same base model as our qwen3:8b but fine-tuned on R1 reasoning data. Let me also check deepseek-r1:14b vs our current phi4:14b:
assistant: Good β here's the lineup that fits our 12GB 3080 Ti:
| Model | Disk | Architecture | Notes |
|---|---|---|---|
deepseek-r1:8b (latest) |
5.2GB | Qwen3-8B + R1 reasoning distill | Same base as qwen3:8b but R1-tuned |
deepseek-r1:7b |
4.7GB | Qwen2.5-7B + R1 distill | Older distill, Qwen2.5 base |
deepseek-r1:14b |
9.0GB | Qwen2-14B + R1 distill | Q4_K_M, same size slot as phi4:14b |
The key insight: deepseek-r1:8b is the updated R1-0528 distill on Qwen3 β it literally replaced the old Llama-based 8B. And deepseek-r1:14b is the Qwen2-14B distill, which is what we originally had on the roadmap.
Let me pull both the 8b and 14b and benchmark them against our current lineup. The 8b goes head-to-head with qwen3:8b (same base, R1-tuned vs vanilla). The 14b competes with phi4:14b for the "occasional local thinking" slot.
user: Conversation info (untrusted metadata):
{
"chat_id": "telegram:8386527252",
"message_id": "2878",
"sender_id": "8386527252",
"sender": "Matt",
"timestamp": "Fri 2026-04-17 16:37 UTC"
}
Sender (untrusted metadata):
{
"label": "Matt (8386527252)",
"id": "8386527252",
"name": "Matt",
"username": "Hoffmann_Matthew"
}
So the only other goal was to evaluate deepseek models locally for some reasoning tasks. Do you want to dive into that model research for our needs and see if thatβs a better option than the qwen model for βoccasional local thinking?β!
assistant: Both pulled. Now let me run benchmarks comparing deepseek-r1:8b and deepseek-r1:14b against our current reasoning fallback (phi4:14b) on tasks that actually need reasoning:
assistant: The benchmark got killed β deepseek-r1:14b is timing out on conflict_resolve and scheduling_puzzle. It's slower than expected and hitting the 120s limit. Let me get partial results and summarize:
assistant: Benchmark got killed before writing results. But I have enough data from the live output. Here's the breakdown:
| Model | conflict_resolve | rejection_intent | newsletter_dedup | scheduling_puzzle |
|---|---|---|---|---|
| phi4:14b | β π§ 54.5s | β π§ 43.5s | β π§ 14.0s | β π§ 71.8s |
| deepseek-r1:8b | β timeout | β π§ 7.4s | β π§ 3.5s | β timeout |
| deepseek-r1:14b | timeout (120s+) | β π§ 37.3s | β π§ 40.6s | killed |
Observations:
- deepseek-r1:8b β Fast when it works (3.5β7.4s), but hitting token limits (2048 tokens) on complex tasks. The reasoning is there but output gets truncated β JSON invalid.
- deepseek-r1:14b β Even slower than phi4:14b (40s+ on simple tasks), times out on complex ones. Not viable for real-time use.
- phi4:14b β Slow (14β72s) but completes everything reliably with reasoning.
The problem: Both deepseek-r1 models output extensive chain-of-thought wrapped in <think>...</think> tags, then the actual answer. Even with num_predict: 2048, they're burning tokens on internal reasoning and leaving nothing for the structured JSON output.
Verdict: Keep phi4:14b as the local reasoning fallback. It's the only one that reliably completes complex reasoning tasks to valid JSON. The deepseek-r1 models are impressive for chat/open-ended reasoning, but unsuited for our structured extraction pipeline without a much larger token budget (8192+).
Next steps? Remove deepseek-r1:8b and :14b to free 14GB disk, or keep r1:8b for occasional manual use (chat-style, not pipelined)?
user: System (untrusted): [2026-04-17 16:39:00 UTC] Exec completed (tender-t, code 0) :: {"status":"success"}
System (untrusted): [2026-04-17 16:41:04 UTC] Exec completed (tidal-ro, code 0) :: {"status":"success"}
System (untrusted): [2026-04-17 16:54:19 UTC] Exec failed (kind-for, signal SIGTERM) :: =================================== deepseek-r1:14b ============================================================ Warmup deepseek-r1:14b... OK conflict_resolve ERROR: HTTPConnectioβ¦
Conversation info (untrusted metadata):
{
"chat_id": "telegram:8386527252",
"message_id": "2885",
"sender_id": "8386527252",
"sender": "Matt",
"timestamp": "Fri 2026-04-17 16:52 UTC"
}
Sender (untrusted metadata):
{
"label": "Matt (8386527252)",
"id": "8386527252",
"name": "Matt",
"username": "Hoffmann_Matthew"
}
Who wins?
assistant: phi4:14b β and it's not close.
| Metric | phi4:14b | r1:8b | r1:14b |
|---|---|---|---|
| Task completion | 4/4 | 2/4 | 1/4 |
| JSON valid | 4/4 | 2/4 | 1/4 |
| Shows reasoning | 4/4 | 2/4 | 1/4 |
| Speed | 14β72s | 3.5β7s (when it works) | 37s+ / timeout |
| Reliability | β Always completes | β Truncates on complex tasks | β Truncates + timeouts |
The deepseek-r1 models burn their token budget on `