If you found this from the HLE leaderboard, here’s the headline:
1,119 correct out of 2,158 on the canonical text-only set. Judged with CAIS’s official o3-mini-2025-01-31 methodology, bit-identical to the published evaluation script. Single workstation. Total per-question cost: ~$1.60.
No GPU cluster. No proprietary fine-tuning. The agent uses Anthropic’s Claude Opus 4.7 through their public APIs, wrapped in a governance framework I’ve spent the last year developing called FF-STACK. That governance work is the foundation; the multi-agent orchestration on top is what turned it into a competitive benchmark score.
This post is the HLE writeup. It covers the architecture, where the score came from, and where the next version goes. The methodology paper has the deeper technical detail.
All of this is possible because of a research program that started long before HLE: a year of cross-architecture behavioral research that produced the governance framework as a byproduct, along with several other things I’ve been building in parallel. If you want that origin story, plus the broader research behind the agent, I’ve put it in a companion post.
Why HLE
My research was always meant to be exploratory and broader than performance alone. It is primarily AI behavioral research: how models reason, where they fail, how they respond to structure, and whether those patterns hold across architectures. But I also wanted to test the work in the same arena where frontier systems are usually compared: standardized benchmarks and agentic evaluation setups. So I built a local agent and tested it on HLE, one of the least-saturated and most discriminating public benchmarks for frontier LLM reasoning.
Most of the underlying research was conducted at the inference layer in standard chat interfaces, including cross-architecture stack testing between Claude, Gemini, and GPT. Using the research base I already had, I built and tuned a local agent in about six weeks, on a limited budget, and got the results above.
I chose HLE because it still has meaningful headroom. Many older benchmarks no longer separate frontier models well: MMLU now sits at 93% accuracy, GPQA at 94%, and HellaSwag and HumanEval are pinned near ceiling. HLE was designed in response to that problem, with expert-level questions across broad academic domains and much lower frontier-model accuracy. That makes it useful for testing whether a method produces real lift rather than noise. On HLE, the strongest vanilla models score in the high 30s to low 40s, while agentic systems cluster around the 50s. That is the kind of gap where an inference-layer architecture can actually move the needle.
The caveat is that HLE is still a benchmark. It grades absolute correctness against a frozen set of questions. It doesn’t measure self-correction, doesn’t penalize hallucinated compliance, and decays the moment a frontier model trains on the public set. The companion post goes deeper on these limitations.
That is also why I have been building Crucible, a custom evaluation framework that emerged as another byproduct of the research. The saturation problem felt structural, not incidental. Crucible is designed to test the behaviors that standard benchmarks often miss: how models handle ambiguity, revise under pressure, weigh conflicting evidence, and recover from failure. A useful surprise was that Crucible’s empirical lifts on my own evaluation questions tracked the HLE lifts within a few points across architectures, which gave my testing methodology more signal when the numbers came in.
To ground these claims: HLE’s own paper frames it as a response to benchmark saturation, noting that LLMs now exceed 90% accuracy on benchmarks such as MMLU, which limits their usefulness for measuring frontier capability. Scale’s HLE leaderboard describes MMLU and GPQA as formerly frontier benchmarks whose saturation makes them weaker signals of current progress.
Finally, the submission itself is to the HuggingFace Zoom AI leaderboard for Agents with Tools, the HLE track for tool-enabled agentic systems. That is the right peer group for FF-STACK v8.
The Pipeline
Each HLE question runs through a four-stage pipeline. The shape matters because the lift comes from how the stages compose, not from any single component:
Sentinel (Haiku domain classifier)
│
▼
Forge (pre-research, vanilla-API knowledge primer)
│
▼
Solver(s): 2-3 legs per question, including cross-architectural pairing on most domains
│
▼
Arbiter (Opus 4.7 + full governance + tools)
│
▼
Final answer + provenance trace
Sentinel is a lightweight classifier (Haiku-class model) that picks the question’s HLE domain. Cheap and fast.
Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy reasoning starts. Runs vanilla. No governance overhead. Its job is to seed the solvers with accurate context so they don’t burn rounds discovering what’s already documented.
The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes: high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also used to bring in different training data, though GPT hasn’t been tuned for the stack yet.
For most domains, two solver legs run in parallel, frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.
The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.
The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.
What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking), all governed, plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.
The Result Surface
Headline: 1,119 / 2,158 = 51.85% on canonical o3-mini judging.
Per-domain
| Domain | N | Score |
|---|---|---|
| Math | 956 | 60.4% |
| Chemistry | 110 | 55.5% |
| Humanities | 263 | 53.2% |
| Other | 72 | 47.2% |
| Computer Science | 285 | 43.9% |
| Physics | 207 | 42.5% |
| Biology | 196 | 36.7% |
| Engineering | 64 | 34.4% |
Math carries the score: 956 questions at 60% delivers roughly half the total correct answers. Biology and Engineering at 35-37% are the structural ceiling; both are knowledge-heavy domains where verification tools help less than they do on Math or Chemistry. Lifting the knowledge-heavy domains further will require more cross-model coverage.
Cross-generation ablation
The 51.85% on full HLE is one run on one config. The richer question is how the architecture compares to vanilla baselines on the same questions, and what happens when the same methodology runs on the previous model generation. Testing started on Opus 4.6 non-thinking, where the majority of the time and budget went. I was about to run the full 4.6 when 4.7 dropped, so I pivoted to tune for that.
The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge.
(Three smaller sample sets show up across this writeup, all designed as representative spreads across HLE’s eight domains: a 100Q calibration sample at 57%, this 100Q purpose-hard holdout at 53%, and a 200Q fresh holdout at 55%. The score range between them isn’t really about sample size. It reflects how hard the spread is to calibrate at higher score levels: when more of the discrimination is happening at the hard end of the distribution, small variation in that spread changes the headline score meaningfully. Across the 400 questions in these three samples, the average is 55%, close to the full HLE result of 51.85%.)
| Config | Score | Notes |
|---|---|---|
| Vanilla Opus 4.6 | 18% | bare API |
| Vanilla Opus 4.7 high | 29-33% | bare API |
| Vanilla Opus 4.7 xhigh | 30-33% | bare API |
| Vanilla GPT-5.4 reasoning=high | 35% | bare API |
| Vanilla Opus 4.7 thinking-high | 38% | bare API |
| FF-STACK on Opus 4.6 | 45% | full FF-STACK on Opus 4.6, the 4.6-era gold |
| FF-STACK single-model on Opus 4.7 high | 44% | full FF-STACK + high effort |
| FF-STACK single-model on Opus 4.7 thinking-high | 46% | best 4.7 single-model |
| V8 cross-arch (full pipeline) | 53% | submission config, within 1pp of full HLE |
Points worth noting:
- Vanilla baseline calibrates with third-party numbers. ScaleAI’s published Opus 4.6 non-thinking score of 19% sits within sampling noise of the 18% I measured on this subset.
- The 4.6 governed config outscored vanilla base by 27 points (45% vs 18% on this sample). That’s the single largest config-level lift in the research log. The same lift-to-lift comparison holds a generation back: Anthropic’s published Opus 4.6 numbers go from 40.0% without tools to 53.1% with their agent layer (llm-stats), a ~13pp lift. So on the previous generation FF-STACK’s +27pp governed lift was roughly double Anthropic’s agent lift on the same model, each measured over its own baseline.
- Single-model FF-STACK lift on Opus 4.7 was ~10pp per mode, with the 4.7 high config at 44% (vs 29-33% vanilla = +11-15pp) and the 4.7 thinking-high config at 46% (vs 38% vanilla = +8pp). Still meaningful lift, but materially compressed from 4.6’s +27pp. That compression on individual modes pushed me toward pair-vote, cross-architectural verification, and other multi-leg orchestration patterns, eventually producing the multi-agent V8 config above.
- Two kinds of comparison, doing different jobs. The vanilla rows show where the lift comes from (mechanism attribution); on this sample that’s +15pp over the strongest vanilla 4.7. The lift over vanilla on its own isn’t a fair comparison against other systems, though. The fair comparison is how FF-STACK does against other agent lifts. One comparison, with caveats, is Anthropic: they publish both ends for Opus 4.7 (summarized at llm-stats), 46.9% without tools and 54.7% with their agent layer, an 8pp lift. Lift gets harder to find as the base approaches the SOTA ceiling, and granting that, FF-STACK is putting a ~15pp bump on the strongest base model it uses (38% vanilla → 53% on this sample). Outside a few vanilla GPT calls, the submission config is almost entirely Opus, so it stays close to a like-for-like Opus comparison. On absolute score, FF-STACK’s 51.85% on the full text-only set sits about 3pp below that 54.7% (part of the gap is the denominator: Anthropic’s full set includes ~14% multimodal questions this submission doesn’t cover). The methodology paper has the full peer comparison, the per-mode lift figures, and the two-anchor breakdown.
- Another peer comparison: Zoom AI. Zoom submitted to this same HLE leaderboard with a “federated” agent that orchestrates multiple frontier models (GPT-5/GPT-5.2 and Gemini 3 Pro Preview). They posted both a full-set score (53.0%) and a text-only score (55.2%), both on Hugging Face. On the text-only set, FF-STACK’s 51.85% lands a few points behind an enterprise system.
Cost
| This work | Typical published agent-with-tools | |
|---|---|---|
| Per question | ~$1.60 | $5-$15 |
| Total full HLE | ~$3,500 | $11k-$32k |
| Hardware | One workstation, 6 worker threads | Often described as multi-stage pipelines with specialized vision routing and per-Q-type ensembles |
The single-model version of the agent (no pair-vote, no arbiter) costs roughly $0.80/Q and scores in the high-40s. The multi-agent v8 config doubles that for the lift to 51.85%. The “typical published agent-with-tools” figures are estimated from public architecture descriptions, not normalized cost comparisons.
The budget was a real constraint. This is why the cross-architecture comparison uses vanilla model runs rather than stack-tuned variants, and why higher-end models are not yet in the configuration. But keeping the system close to baseline API cost was also intentional. I wanted to know whether the method produced lift under conditions that other people could actually reproduce.
That matters because performance gains are less informative when they depend on specialized fine-tunes, proprietary models, or heavy per-question compute. Those gains may be real, but they are harder to separate from the infrastructure that produced them. If a method is real and architecture-agnostic, it should produce lift without making the economics impractical.
In this run, most of the per-question cost came from standard model inference and tool-use rounds, the same loop any agent-with-tools system already runs. The FF-STACK governance layer added very little overhead: the prompt was cached, and the routing logic was just dispatch. The lift was not bought with compute. And if the budget was the ceiling, then the score also points to meaningful headroom.
Caveats
A handful of things worth flagging up front, all covered in more detail in the methodology paper:
- Text-only. This submission excludes HLE’s ~14% multimodal questions. The vision adapter is built but not yet through a full evaluation run.
- Single run, not averaged. The 51.85% is one full-HLE pass, not a multi-run mean. Calibration replays show same-sample variance compresses to near zero at the ensemble level, but a 3-run mean on the full set would tighten the central estimate. Candidate work for the next submission cycle.
- Stated-confidence overconfidence. The agent’s self-reported confidence numbers are systematically inflated (~22-30pp above true accuracy on the high-confidence buckets). The ranking is correct (90+ buckets really are more accurate than 50-69 buckets) but the absolute values shouldn’t be read at face value. No calibration tuning was done on this submission; it’s noted for the next cycle. The methodology paper goes into why this is a different thing from the agent’s research-mode evidence discipline.
- Bio/Chem safety-refusal pattern. Around 3% of Bio/Chem questions trigger Anthropic’s content filter at the architecture level. The direct-commit bypass rescues most of them; the rest hit the next limitation.
- Five unrecovered questions. Out of 2,158 text-only questions, five hit deterministic refusal patterns the bypass couldn’t rescue. They’re scored as incorrect (in the 2,158 denominator, contributing zero to the 1,119 numerator). Max upside if all five had been recovered and correct: +0.23pp.
That’s the honest picture. The methodology paper has the full disclosures including the URL blacklist policy and the answer-format preamble that suppresses refusal cascades.
What I Built First, Then Tuned
HLE wasn’t the origin of this work. It was a late stress test. Most of what made the score possible came first: roughly a year of cross-architecture behavioral research conducted in plain-text agents, which produced the FF-STACK governance framework and, in parallel, a custom evaluation methodology (Crucible); then Cade (the local agent), built on that framework for my own use. Only after all of that came the HLE-specific tuning: about six weeks of evening work, total, spread across two model generations.
That distinction matters. The HLE result wasn’t the product of a year spent optimizing against one benchmark. It was a downstream test of a framework that already existed. The timeline below covers that final phase.
- Days: built the local agent (“Cade,” short for Cadence) on top of the existing FF-STACK governance framework. Eighteen tools, claim tracking, evidence-tier discipline, an independent adversarial-review pass on every substantive answer. This part wasn’t built for HLE; it was built as a research assistant for my own work.
- ~3 weeks (4.6 era): tuning Claude Opus 4.6 with the governance stack went from a +5pp lift over vanilla 4.6 on hard reasoning questions to a +27pp lift (45% vs vanilla 4.6 at 18% on a calibration set). That’s the single largest config lift in the entire research log.
- 4.7 launched: initial vanilla scores moved up; the +27pp playbook didn’t transfer cleanly. Most 4.6 governance assumptions (temperature controls, round caps, thinking-mode behavior) had to be reworked.
- ~1.5 weeks (4.7 mode-tuning): per-effort-mode pipelines tuned, governance prompt re-fitted, output budget recalibrated for 4.7’s larger token inflation. Lift recovered to ~+10pp per mode.
- ~2 weeks (multi-agent build): cross-architectural pair-vote with GPT-5.4 added on most domains, arbiter introduced with the simple-arbiter prompt, direct-commit bypass for multi-leg failures. Calibration moved to 54-57%. Full-HLE single run landed at 51.85%, within the predicted variance band.
The reason I’m laying that out is that the lift is not an artifact of slow accumulation. The methodology produced two cycles of large gains in roughly six weeks of evening tuning total, on two different model generations, with the same underlying research framework. That’s the part that interests me. The score is downstream.
What’s Next
Three lines of work are already queued. The first two are wired and partially smoke-tested. The third is infrastructure-ready.
-
Further model tuning. Two directions here. The first is deepening the work on Opus 4.7 itself: most of the stack tuning happened on 4.6, and the 4.7 single-model configs are comparatively immature, so per-mode tuning and further study on 4.7 is an underexplored lever that could close or invert the 4.6/4.7 gap on its own. The second is adding more capable models, including GPT Pro and Gemini Pro, currently paused because of cost, not infrastructure.
-
Vision extension. A vision adapter is already built. Early image-question tests show lift from governance alone, but the vision path needs more development before it is ready for a full evaluation run.
-
Cross-architecture governance. The cleanest path to raising the multi-agent ceiling is applying FF-STACK governance across all model legs, not just Opus. The 4.6 to 4.7 transition produced meaningful lift on Anthropic models, first +27 points in the 4.6 configuration and then roughly +15 points after 4.7-specific tuning. That suggests the framework is portable, but full per-architecture tuning is still incomplete. The infrastructure for governed GPT and Gemini runs already exists. The missing piece is the tuning cycle for each architecture. A governance-on-all-legs configuration is the clearest next step toward pushing the multi-agent ceiling materially past 55%.
There’s a longer roadmap behind this: automating the manual tuning flywheel through a multi-agent research environment, moving evaluation beyond binary scoring, and productizing inference-layer governance for consumer use. But those are the research-side moves, and they belong in the companion post.
A note on the disclosure level
Publishing this much of the methodology and config is probably atypical for a leaderboard submission. Most submitters share the headline, a model card, maybe a blog post. The detailed stuff (filtering audit basis, exact preambles, calibration buckets, ablation tables) usually stays between the submitter and the maintainers, or doesn’t get written down at all.
I went the other way because I don’t really see the point of benchmarks and scores if no one’s honest about how they got them. When I was trying to figure out where my own work sat in the landscape, the hardest part was finding out what counted as a good score on HLE and what other setups actually looked like under the hood. Most descriptions were vague enough that I couldn’t benchmark myself against them with any confidence. If this writeup is useful to someone else trying to do the same, that’s reason enough.
The bigger thing, though: this was never really about the test. It was about reaching a point in the research where I felt ready to put it out in the world and have other people poke at it. The benchmark gave me a concrete thing to write everything down around. The transparency is what makes it actually open for collaboration.
Bridge
If you came from the leaderboard and made it this far, this is the part that matters more than the score: the agent was built quickly because the methodology underneath it had already been running for almost a year.
The governance framework, pair-vote pattern, arbiter, failure-mode taxonomy, evidence-tier system, and claim ledger were not designed in advance as a product roadmap. They were observed in cross-architecture dialogue, encoded as text, tested against new models, codified into Python, and integrated into the agent.
That cycle is the renewable advantage.
The HLE score is one artifact of it.
If that sounds interesting, the companion post covers the research program: how the patterns were discovered, what other products came out of the same work (Crucible for evaluation, Foundry for multi-agent research pipelines), and what the bigger picture looks like.
Eugene Dvorochkin is an independent AI researcher and the creator of Fieldframe Labs, mapping and structuring behavioral patterns in LLMs since May 2025.
Methodology paper for the HLE submission lives here. A scrubbed predictions file is on GitHub; the raw file is available on request. Contact: edvorochkin@gmail.com.