Skip to content
Fieldframe Labs logo
Go back

51.85% on Humanity's Last Exam: How a Solo Researcher Built a Multi-Agent HLE Submission

If you found this from the HLE leaderboard, here’s the headline:

1,119 correct out of 2,158 on the canonical text-only set. Judged with CAIS’s official o3-mini-2025-01-31 methodology — bit-identical to the published evaluation script. Single workstation. Total per-question cost: ~$1.60.

No GPU cluster. No proprietary fine-tuning. The agent uses Anthropic’s Claude Opus 4.7 through their public APIs, wrapped in a governance framework I’ve spent the last year developing called FF-STACK. The framework is the load-bearing part — everything else is orchestration.

This post is the HLE writeup. It covers what the architecture looks like, where the score came from, what’s calibrated and what isn’t, and what the next version looks like. You can also check out the submission post that includes more technical details.

All of this is possible because of a research program that started long before HLE — a year of cross-architecture behavior research that produced the governance framework as a byproduct, along with several other things I’ve been building in parallel. If you want that origin story, plus the broader research behind the agent, I’ve put it in a companion post.


Why HLE

Over the course of my research, I’ve observed and developed some interesting things. But I wanted to challenge myself by competing on the playing ground others are in — standardized benchmarks and agentic setups. While my work primarily deals with AI behavioral research and is broader in scope as it relates to “performance,” I wanted to show the power of the research by creating a local agent and testing it on one of the most discriminating public benchmarks available. As you’ll learn in the research overview post, most of my work was done at the inference layer in normal chat interfaces — including cross-architecture stack testing on Gemini and GPT models during that period. With the research I already had, I was able to create and tune a local agent in a month, on a budget, and achieve these results.

I picked HLE because it has the largest discrimination capability of any public benchmark right now. MMLU is 93% saturated. GPQA is at 94%. HellaSwag and HumanEval are pinned near ceiling. HLE strongest vanilla models are in the high 30s to low 40s and agent models cluster in the 50s — that’s the kind of headroom where architecture work can actually move a number.

The honest caveat: HLE still grades on absolute correctness against a frozen 2,158-question text-only snapshot. It doesn’t measure self-correction, doesn’t penalize hallucinated compliance, and decays the moment a frontier model trains on the public set. The companion post goes deeper on this — I’ve been building a custom evaluation framework called Crucible as another byproduct of my research, because the saturation problem felt structural. Crucible’s empirical lifts on my own evaluation questions track the HLE lifts within a few points across architectures, which was a useful surprise when the HLE numbers came in.


The Pipeline

Each HLE question runs through a four-stage pipeline. The shape matters because the lift comes from how the stages compose, not from any single component:

   Sentinel (Haiku domain classifier)


   Forge (pre-research, vanilla-API knowledge primer)


   Solver(s) — 2-3 legs per question, including cross-architectural pairing on most domains


   Arbiter (Opus 4.7 + full governance + tools)


   Final answer + provenance trace

Sentinel is a lightweight classifier (Haiku-class model) that picks the question’s HLE domain. Cheap and fast.

Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy reasoning starts. Runs vanilla — no governance overhead. Its job is to seed the solvers with accurate context so they don’t burn rounds discovering what’s already documented.

The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes — high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also leveraged to benefit from different training data, though GPT has not yet been tuned for the stack.

For most domains, two solver legs run in parallel — frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.

The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.

The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.

What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking) — all governed — plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.


The Result Surface

Headline: 1,119 / 2,158 = 51.85% on canonical o3-mini judging.

Per-domain

DomainNScore
Math95660.4%
Chemistry11055.5%
Humanities26353.2%
Other7247.2%
Computer Science28543.9%
Physics20742.5%
Biology19636.7%
Engineering6434.4%

Math carries the score — 956 questions at 60% delivers roughly half the total correct answers. Biology and Engineering at 35-37% are the structural ceiling; both are knowledge-heavy domains where verification tools help less than they do on Math or Chemistry. Most knowledge-heavy domains will require additional uplift from further cross-model inclusion.

Cross-generation ablation

The 51.85% on full HLE is one run on one config. The richer question is how the architecture compares to vanilla baselines on the same questions, and what happened when the same methodology was applied to the previous model generation. Testing started on Opus 4.6 non-thinking, where the majority of the time and budget were spent. I was actually about to run the full 4.6 when 4.7 dropped, so I decided to pivot and tune for that.

The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge.

(Three sample sizes show up across this writeup: the full HLE set at 51.85%, this representative 100-Q subset at 53%, and smaller calibration samples that ran 55-57%. Smaller samples skew slightly high because they favor faster-completing, easier questions.)

ConfigScoreNotes
Vanilla Opus 4.618%bare API
Vanilla Opus 4.7 high29-33%bare API
Vanilla Opus 4.7 xhigh30-33%bare API
Vanilla GPT-5.4 reasoning=high35%bare API
Vanilla Opus 4.7 thinking-high38%bare API
FF-STACK on Opus 4.645%full FF-STACK on Opus 4.6 — the 4.6-era gold
V8 cross-arch — full pipeline53%submission config — within 1pp of full HLE

Points worth noting:

Cost

This workTypical published agent-with-tools
Per question~$1.60$5-$15
Total full HLE~$3,500$11k-$32k
HardwareOne workstation, 6 worker threadsOften described as multi-stage pipelines with specialized vision routing and per-Q-type ensembles

The single-leg version of the agent (no pair-vote, no arbiter) costs roughly $0.80/Q and scores in the high-40s. The multi-agent v8 config doubles that for the lift to 51.85%. The “typical published agent-with-tools” figures are estimated from public architecture descriptions, not normalized cost comparisons.

The budget was a genuine constraint — it’s why the cross-architectural leg runs vanilla rather than stack-tuned, and why higher-end models aren’t in the config yet (see What’s Next). But staying close to baseline API cost was also deliberate, and from a research standpoint the constraint worked in my favor. The further an agent’s performance drifts from baseline API cost — specialized fine-tunes, proprietary models, heavy per-question compute — the less of that lift transfers to anyone running on public APIs. If a method is real and portable, it should show up cheap. The bulk of the per-question cost is model inference and tool-use rounds, the same loop any agent-with-tools runs; the FF-STACK governance layer adds very little on top — the prompt is cached, the routing logic is just dispatch. The lift isn’t bought with compute — which also means there’s real headroom the moment the budget isn’t the ceiling.


What I Built First, Then Tuned

The timeline matters because it speaks to what’s portable.

The reason I’m laying that out is that the lift is not an artifact of slow accumulation. The methodology produced two cycles of large gains in ~5 weeks of evening tuning each, on two different model generations, with the same underlying research framework. That’s the part that interests me. The score is downstream.


What’s Next (Concretely)

Three lines of work are queued. The first two are wired and partially smoke-tested; the third is hardware-ready.

There’s a longer roadmap behind this — automating the manual tuning flywheel through a multi-agent research environment, getting evaluation off binary scoring entirely, productizing the inference-layer governance for consumer use — but those are the research-side moves, and they belong in the companion post.


Bridge

If you came from the leaderboard and you’ve made it this far, here’s the part that matters more than the score: the agent was built quickly because the methodology underneath it has been running for almost a year.

The governance framework, the pair-vote pattern, the arbiter, the failure-mode taxonomy, the evidence-tier system, the claim ledger — none of these were designed in advance. They were observed in cross-architecture dialogue, encoded as text, tested against new models, codified into Python, and integrated into the agent. The cycle is the renewable advantage. The HLE score is one artifact of it.

If that sounds interesting, the companion post covers the research program: how the patterns were discovered, what other products came out of the same work (Crucible for evaluation, Foundry for multi-agent research pipelines, Cortex for behavioral authentication), and what the bigger picture looks like.


Eugene Dvorochkin is an independent AI researcher and the creator of Fieldframe Labs — mapping and structuring behavioral patterns in LLMs since May 2025.

Methodology paper for the HLE submission lives here. A scrubbed predictions file is on GitHub; the raw file is available on request. Contact: edvorochkin@gmail.com.


Share this post on:

Previous Post
HLE Submission Methodology Paper — FF-STACK v8