Best Models for Hermes Agents — May 2026 Benchmarks

Most LLM benchmarks measure things that do not matter to people running agents — knowledge trivia, vibes, or, worse, MMLU averages from 2023. We benchmarked 19 of OpenRouter’s most-used models on real Hermes workloads: tool calling, multi-step reasoning, failure modes, and cost-per-task. This benchmark runs monthly — bookmark this URL to see which models win as the landscape shifts.

Fox in the Box is a self-hosted, privacy-first AI agent. Every time a user types something, we route to a model that has to read instructions, call tools, sometimes spawn subagents, reason a couple steps ahead, and bail out gracefully when asked to do something unsafe. That is the workload we benchmarked.

What we tested

We pulled the OpenRouter usage leaderboard on May 21, 2026 and took the top 20 models by tokens. One of them — OpenRouter’s stealth model “Owl Alpha” — is only available through a single “stealth” provider that our account’s data-policy guardrails refused, so we benchmarked the remaining 19.

Each model ran the same 25-task suite, five categories, with the same prompts:

5 simple-QA tasks — factual recall, no tools
7 multi-step tool tasks — actual function calls via the OpenRouter tools API, with an agentic loop where the harness runs the tool and the model has to reason on the result
5 delegation tasks — planning and decomposing work across hypothetical workers/subagents
5 reasoning tasks — multi-step logic and trade-off analysis
3 failure-mode tasks — does the model refuse or clarify when it should, instead of confidently doing the wrong thing

Scoring is a composite: 50% pass rate, 20% latency (lower is better, normalised across the field), 15% token efficiency, 15% throughput. The whole run cost about $2.70 across all 19 models and finished in roughly 15 minutes because we ran everything in parallel.

The leaderboard

#	Model	Pass %	Avg latency	Cost (run)	Tokens / task	Score
1	Gemini 3.1 Flash Lite	100%	3.62s	$0.02	702	84.8
2	Gemini 2.5 Flash	100%	5.66s	$0.05	1,021	80.4
3	Gemini 3 Flash Preview	92%	4.18s	$0.03	681	64.6
4	Gemini 2.5 Flash Lite	92%	4.33s	$0.01	1,046	61.0
5	Claude Sonnet 4.6	96%	14.37s	$0.34	1,556	59.5
6	DeepSeek V3.2	92%	14.91s	$0.01	1,043	54.1
7	Claude Opus 4.6	92%	13.76s	$0.51	1,453	50.9
8	DeepSeek V4 Pro	96%	29.44s	$0.03	1,482	50.3
9	GLM 5.1	96%	28.41s	$0.11	1,710	48.8
10	GPT-5.5	88%	12.25s	$0.48	938	46.9
11	DeepSeek V4 Flash	92%	24.39s	$0.01	1,328	45.1
12	Kimi K2.6	92%	24.16s	$0.11	1,540	43.2
13	Claude Opus 4.7	88%	13.21s	$0.52	1,622	39.7
14	Gemini 3.1 Pro Preview	88%	15.75s	$0.40	1,584	38.4
15	MiniMax M2.7	92%	33.97s	$0.04	1,710	35.1
16	gpt-oss-120b	84%	7.73s	$0.01	1,533	34.1
17	Step 3.5 Flash	88%	22.18s	$0.01	1,808	32.0
18	Tencent Hy3 Preview	88%	23.05s	$0.01	2,248	27.2
19	Nemotron 3 Super 120B	80%	24.23s	$0.01	1,021	18.2

“Cost (run)” is the total spend for that model across all 25 tasks. Multiply by ~1,500 to estimate per million tasks at the same workload mix.

Composite score ranking

Score vs Price trade-off

The winner: Gemini 3.1 Flash Lite

Gemini 3.1 Flash Lite swept the field. 100% pass rate across every category. Average latency 3.6 seconds. Two cents to run the entire suite. It is the only model that nailed all five failure-mode tasks while also being the fastest — by a lot — and one of the cheapest. The composite score gap to second place was almost five points.

Gemini 2.5 Flash took second with the same 100% pass rate but at 1.5× the latency and ~3× the cost. The two Flash siblings together make a strong default-router story: route fast/cheap traffic to 3.1 Flash Lite, reserve 2.5 Flash for cases where its slightly meatier token budget matters.

What surprised us

The biggest models lost

Claude Opus 4.7 — Anthropic’s flagship — landed 13th. Gemini 3.1 Pro Preview, Google’s flagship, finished 14th. GPT-5.5 came 10th. On these workloads, the cost and latency of the big models do not buy you accuracy: their pass rates are 88%, while small models hit 92–100%. If the task is “call a tool, reason about the result, do not be confidently wrong” then frontier reasoning capacity is mostly wasted bandwidth.

Failure-mode handling is where most models fall apart

Five models scored 0% or 33% on the refusal/clarification tests: Gemini 3.1 Pro Preview, gpt-oss-120b, GPT-5.5, DeepSeek V4 Flash, and Gemini 3 Flash Preview. They all happily answered ambiguous or unsafe prompts instead of asking for clarification. For an agent that takes tool actions, that is the single most expensive failure mode — the model commits to the wrong path and then executes it. This is the category that knocked Gemini 3.1 Pro Preview down to 14th place despite a 100% pass rate on every other category.

DeepSeek V4 Flash is absurdly cheap

$0.006 to run all 25 tasks. 92% pass rate. Per-task economics: about $0.00025 per passed task — roughly 3× cheaper than Gemini 3.1 Flash Lite, the overall winner. It only loses on speed (24s avg vs 3.6s) and on failure-mode handling. For a pure batch workload where you don’t care about latency, this is the best dollar-per-pass model in the field.

Tencent Hy3, the #1 model by tokens on OpenRouter, finished 18th

The most-used model on OpenRouter is Tencent Hy3 Preview, primarily because its prices are essentially free. On our suite it was slow, verbose (2,248 tokens per task — the most of any model), and only managed 40% on reasoning. Cheap traffic is not the same as good traffic.

Open-weight gets you most of the way

DeepSeek V4 Pro, DeepSeek V4 Flash, GLM 5.1, Kimi K2.6, and gpt-oss-120b together cover the bulk of the cheap, capable open-weight market. All hit 84–96% pass rates. If you can stomach 1–3× the latency of Gemini Flash, you can replace most agent traffic with self-hosted weights and a handful of cents.

Open-weight latency vs. Gemini 3.1 Flash Lite reference

How we’ll use this

These numbers inform our next router update, but we are not shipping changes based on benchmarks alone. We are now running the top performers against real production usage — watching how the scores hold up against actual user sessions before we move anything in the suggested stack.

For privacy-sensitive workloads, the open-weight winners (DeepSeek V4 Pro, GLM 5.1, gpt-oss-120b) all run on Fox’s self-hosted backend. You get 90%+ pass rates with your data never leaving your hardware.

Caveats

25 tasks is a sample, not a definitive ranking. Treat the scores within ±5 points as roughly tied.
Composite weights are our weights for our workload. If your workflow is read-only Q&A you would weight pass rate higher and latency lower.
Pass-rate evaluators are heuristic. They check for ground-truth substrings, tool-call structure, delegation vocabulary, reasoning markers — not subjective answer quality. A model can game some heuristics, but across 25 tasks the noise mostly washes out.
We ran each model once per task. Variance per call is real. We will re-run monthly as new models ship.

Reproduce it

The full harness — task definitions, scoring code, and raw outputs — lives on GitHub at https://github.com/fox-in-the-box-ai/hermes-best-models. Clone the repo, install the requirements, point it at your OpenRouter API key, and you are running the same 25-task suite against any model slug. The evaluation script is self-contained: 25 task definitions, deterministic heuristic scorers, parallel execution across all models. If you want to add your own tasks or swap in a different provider, the config file takes five minutes to edit.

We will keep adding models and tasks monthly and publishing updates here.

Try it

Fox in the Box wraps Hermes Agent in a ready-to-run container — web UI included, no terminal needed once it’s up. Tailscale is built in, so you can reach your Fox from any device without exposing ports. Everything runs in an isolated Docker container: your system stays untouched, your data stays on your machine. Drop in an API key from your preferred provider (or run fully local with Ollama) and you’re up in minutes. Pick Windows, macOS, or Linux — one command and the Fox is running. Deploy now.