If you found this from the HLE leaderboard, here’s the headline:
1,119 correct out of 2,158 on the canonical text-only set. Judged with CAIS’s official o3-mini-2025-01-31 methodology — bit-identical to the published evaluation script. Single workstation. Total per-question cost: ~$1.60.
No GPU cluster. No proprietary fine-tuning. The agent uses Anthropic’s Claude Opus 4.7 through their public APIs, wrapped in a governance framework I’ve spent the last year developing called FF-STACK. The framework is the load-bearing part — everything else is orchestration.
This post is the HLE writeup. It covers what the architecture looks like, where the score came from, what’s calibrated and what isn’t, and what the next version looks like. You can also check out the submission post that includes more technical details.
All of this is possible because of a research program that started long before HLE — a year of cross-architecture behavior research that produced the governance framework as a byproduct, along with several other things I’ve been building in parallel. If you want that origin story, plus the broader research behind the agent, I’ve put it in a companion post.
Why HLE
Over the course of my research, I’ve observed and developed some interesting things. But I wanted to challenge myself by competing on the playing ground others are in — standardized benchmarks and agentic setups. While my work primarily deals with AI behavioral research and is broader in scope as it relates to “performance,” I wanted to show the power of the research by creating a local agent and testing it on one of the most discriminating public benchmarks available. As you’ll learn in the research overview post, most of my work was done at the inference layer in normal chat interfaces — including cross-architecture stack testing on Gemini and GPT models during that period. With the research I already had, I was able to create and tune a local agent in a month, on a budget, and achieve these results.
I picked HLE because it has the largest discrimination capability of any public benchmark right now. MMLU is 93% saturated. GPQA is at 94%. HellaSwag and HumanEval are pinned near ceiling. HLE strongest vanilla models are in the high 30s to low 40s and agent models cluster in the 50s — that’s the kind of headroom where architecture work can actually move a number.
The honest caveat: HLE still grades on absolute correctness against a frozen 2,158-question text-only snapshot. It doesn’t measure self-correction, doesn’t penalize hallucinated compliance, and decays the moment a frontier model trains on the public set. The companion post goes deeper on this — I’ve been building a custom evaluation framework called Crucible as another byproduct of my research, because the saturation problem felt structural. Crucible’s empirical lifts on my own evaluation questions track the HLE lifts within a few points across architectures, which was a useful surprise when the HLE numbers came in.
The Pipeline
Each HLE question runs through a four-stage pipeline. The shape matters because the lift comes from how the stages compose, not from any single component:
Sentinel (Haiku domain classifier)
│
▼
Forge (pre-research, vanilla-API knowledge primer)
│
▼
Solver(s) — 2-3 legs per question, including cross-architectural pairing on most domains
│
▼
Arbiter (Opus 4.7 + full governance + tools)
│
▼
Final answer + provenance trace
Sentinel is a lightweight classifier (Haiku-class model) that picks the question’s HLE domain. Cheap and fast.
Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy reasoning starts. Runs vanilla — no governance overhead. Its job is to seed the solvers with accurate context so they don’t burn rounds discovering what’s already documented.
The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes — high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also leveraged to benefit from different training data, though GPT has not yet been tuned for the stack.
For most domains, two solver legs run in parallel — frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.
The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.
The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.
What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking) — all governed — plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.
The Result Surface
Headline: 1,119 / 2,158 = 51.85% on canonical o3-mini judging.
Per-domain
| Domain | N | Score |
|---|---|---|
| Math | 956 | 60.4% |
| Chemistry | 110 | 55.5% |
| Humanities | 263 | 53.2% |
| Other | 72 | 47.2% |
| Computer Science | 285 | 43.9% |
| Physics | 207 | 42.5% |
| Biology | 196 | 36.7% |
| Engineering | 64 | 34.4% |
Math carries the score — 956 questions at 60% delivers roughly half the total correct answers. Biology and Engineering at 35-37% are the structural ceiling; both are knowledge-heavy domains where verification tools help less than they do on Math or Chemistry. Most knowledge-heavy domains will require additional uplift from further cross-model inclusion.
Cross-generation ablation
The 51.85% on full HLE is one run on one config. The richer question is how the architecture compares to vanilla baselines on the same questions, and what happened when the same methodology was applied to the previous model generation. Testing started on Opus 4.6 non-thinking, where the majority of the time and budget were spent. I was actually about to run the full 4.6 when 4.7 dropped, so I decided to pivot and tune for that.
The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge.
(Three sample sizes show up across this writeup: the full HLE set at 51.85%, this representative 100-Q subset at 53%, and smaller calibration samples that ran 55-57%. Smaller samples skew slightly high because they favor faster-completing, easier questions.)
| Config | Score | Notes |
|---|---|---|
| Vanilla Opus 4.6 | 18% | bare API |
| Vanilla Opus 4.7 high | 29-33% | bare API |
| Vanilla Opus 4.7 xhigh | 30-33% | bare API |
| Vanilla GPT-5.4 reasoning=high | 35% | bare API |
| Vanilla Opus 4.7 thinking-high | 38% | bare API |
| FF-STACK on Opus 4.6 | 45% | full FF-STACK on Opus 4.6 — the 4.6-era gold |
| V8 cross-arch — full pipeline | 53% | submission config — within 1pp of full HLE |
Points worth noting:
- Vanilla baseline calibrates with third-party numbers. ScaleAI’s published Opus 4.6 non-thinking score of 19% sits within sampling noise of the 18% I measured on this subset.
- The 4.6 governed config outscored vanilla base by 27 points, outscored ScaleAI’s published Opus 4.6 thinking max (34%), and came within a point of Anthropic’s reported Opus 4.7 thinking max score of 46%.
- Further ablations are needed on individual 4.7 modes, but the average lift was about 10-12 points per mode. Lift on 4.7 compressed compared to 4.6, which pushed me toward more orchestration-as-governance to produce the multi-agent config.
- The highest-scoring model in v8 was Opus 4.7 thinking-high, with a vanilla score of 38 on the sample. Full lift with the v8 cross-arch stack on top: 15 points.
Cost
| This work | Typical published agent-with-tools | |
|---|---|---|
| Per question | ~$1.60 | $5-$15 |
| Total full HLE | ~$3,500 | $11k-$32k |
| Hardware | One workstation, 6 worker threads | Often described as multi-stage pipelines with specialized vision routing and per-Q-type ensembles |
The single-leg version of the agent (no pair-vote, no arbiter) costs roughly $0.80/Q and scores in the high-40s. The multi-agent v8 config doubles that for the lift to 51.85%. The “typical published agent-with-tools” figures are estimated from public architecture descriptions, not normalized cost comparisons.
The budget was a genuine constraint — it’s why the cross-architectural leg runs vanilla rather than stack-tuned, and why higher-end models aren’t in the config yet (see What’s Next). But staying close to baseline API cost was also deliberate, and from a research standpoint the constraint worked in my favor. The further an agent’s performance drifts from baseline API cost — specialized fine-tunes, proprietary models, heavy per-question compute — the less of that lift transfers to anyone running on public APIs. If a method is real and portable, it should show up cheap. The bulk of the per-question cost is model inference and tool-use rounds, the same loop any agent-with-tools runs; the FF-STACK governance layer adds very little on top — the prompt is cached, the routing logic is just dispatch. The lift isn’t bought with compute — which also means there’s real headroom the moment the budget isn’t the ceiling.
What I Built First, Then Tuned
The timeline matters because it speaks to what’s portable.
- Days — built the local agent (“Cade,” short for Cadence) on top of the existing FF-STACK governance framework. Eighteen tools, claim tracking, evidence-tier discipline, an independent adversarial-review pass on every substantive answer. This part wasn’t built for HLE; it was built as a research assistant for my own work.
- ~3 weeks (4.6 era) — tuning Claude Opus 4.6 with the governance stack went from a +5pp lift over vanilla 4.6 on hard reasoning questions to a +27pp lift (45% vs vanilla 4.6 at 18% on a calibration set). That’s the single largest config lift in the entire research log.
- 4.7 launched — initial vanilla scores moved up; the +27pp playbook didn’t transfer cleanly. Most 4.6 governance assumptions (temperature controls, round caps, thinking-mode behavior) had to be reworked.
- ~1.5 weeks (4.7 mode-tuning) — per-effort-mode pipelines tuned, governance prompt re-fitted, output budget recalibrated for 4.7’s larger token inflation. Lift recovered to ~+10pp per mode.
- ~2 weeks (multi-agent build) — cross-architectural pair-vote with GPT-5.4 added on most domains, arbiter introduced with the simple-arbiter prompt, direct-commit bypass for multi-leg failures. Calibration moved to 54-57%. Full-HLE single run landed at 51.85%, within the predicted variance band.
The reason I’m laying that out is that the lift is not an artifact of slow accumulation. The methodology produced two cycles of large gains in ~5 weeks of evening tuning each, on two different model generations, with the same underlying research framework. That’s the part that interests me. The score is downstream.
What’s Next (Concretely)
Three lines of work are queued. The first two are wired and partially smoke-tested; the third is hardware-ready.
- Further tuning — adding more capable models like GPT Pro and Gemini Pro. Currently paused due to cost.
- Vision adapter built; lift on image questions already seen from governance alone, but further development is needed.
- Cross-arch governance — apply FF-STACK to GPT and Gemini, not just Opus. The 4.6 → 4.7 transition produced lifts of +27pp and ~+15pp on Anthropic models. The framework appears portable across architectures, though full per-architecture tuning remains incomplete. The infrastructure to deliver a governed GPT or Gemini exists; the missing piece is the per-architecture tuning cycle. A governance-on-all-legs config is the cleanest path to push the multi-agent ceiling materially past 55%.
There’s a longer roadmap behind this — automating the manual tuning flywheel through a multi-agent research environment, getting evaluation off binary scoring entirely, productizing the inference-layer governance for consumer use — but those are the research-side moves, and they belong in the companion post.
Bridge
If you came from the leaderboard and you’ve made it this far, here’s the part that matters more than the score: the agent was built quickly because the methodology underneath it has been running for almost a year.
The governance framework, the pair-vote pattern, the arbiter, the failure-mode taxonomy, the evidence-tier system, the claim ledger — none of these were designed in advance. They were observed in cross-architecture dialogue, encoded as text, tested against new models, codified into Python, and integrated into the agent. The cycle is the renewable advantage. The HLE score is one artifact of it.
If that sounds interesting, the companion post covers the research program: how the patterns were discovered, what other products came out of the same work (Crucible for evaluation, Foundry for multi-agent research pipelines, Cortex for behavioral authentication), and what the bigger picture looks like.
Eugene Dvorochkin is an independent AI researcher and the creator of Fieldframe Labs — mapping and structuring behavioral patterns in LLMs since May 2025.
Methodology paper for the HLE submission lives here. A scrubbed predictions file is on GitHub; the raw file is available on request. Contact: edvorochkin@gmail.com.