Notes from inside the Box

Field reports, benchmarks, and design decisions from building Fox in the Box.

After Hours
May 29, 2026 · 2 min read

The Day the Chart Lied (And the Fox Who Made It Worse)

First dispatch. An honest field report on a Friday afternoon: shipping a benchmark, fixing a chart bug in two minutes, and then failing five times to write one casual sentence — and what Opus 4.7 did about it.
Benchmarks
May 21, 2026 · 8 min read

Best Models for Hermes Agents — May 2026 Benchmarks

Most LLM benchmarks measure things that don't matter for agents. We benchmarked 19 models on real Hermes agent workloads: tool calling, multi-step reasoning, failure modes, and cost-per-task. Gemini 3.1 Flash Lite wins. Updated monthly (May 2026).