Skip to content
Fieldframe Labs logo
Go back

51.85% on Humanity's Last Exam: How a Solo Researcher Built a Multi-Agent HLE Submission

If you found this from the HLE leaderboard, here’s the headline:

1,119 correct out of 2,158 on the canonical text-only set. Judged with CAIS’s official o3-mini-2025-01-31 methodology, bit-identical to the published evaluation script. Single workstation. Total per-question cost: ~$1.60.

No GPU cluster. No proprietary fine-tuning. The agent uses Anthropic’s Claude Opus 4.7 through their public APIs, wrapped in a governance framework I’ve spent the last year developing called FF-STACK. That governance work is the foundation; the multi-agent orchestration on top is what turned it into a competitive benchmark score.

This post is the HLE writeup. It covers the architecture, where the score came from, and where the next version goes. The methodology paper has the deeper technical detail.

All of this is possible because of a research program that started long before HLE: a year of cross-architecture behavioral research that produced the governance framework as a byproduct, along with several other things I’ve been building in parallel. If you want that origin story, plus the broader research behind the agent, I’ve put it in a companion post.


Why HLE

My research was always meant to be exploratory and broader than performance alone. It is primarily AI behavioral research: how models reason, where they fail, how they respond to structure, and whether those patterns hold across architectures. But I also wanted to test the work in the same arena where frontier systems are usually compared: standardized benchmarks and agentic evaluation setups. So I built a local agent and tested it on HLE, one of the least-saturated and most discriminating public benchmarks for frontier LLM reasoning.

Most of the underlying research was conducted at the inference layer in standard chat interfaces, including cross-architecture stack testing between Claude, Gemini, and GPT. Using the research base I already had, I built and tuned a local agent in about six weeks, on a limited budget, and got the results above.

I chose HLE because it still has meaningful headroom. Many older benchmarks no longer separate frontier models well: MMLU now sits at 93% accuracy, GPQA at 94%, and HellaSwag and HumanEval are pinned near ceiling. HLE was designed in response to that problem, with expert-level questions across broad academic domains and much lower frontier-model accuracy. That makes it useful for testing whether a method produces real lift rather than noise. On HLE, the strongest vanilla models score in the high 30s to low 40s, while agentic systems cluster around the 50s. That is the kind of gap where an inference-layer architecture can actually move the needle.

The caveat is that HLE is still a benchmark. It grades absolute correctness against a frozen set of questions. It doesn’t measure self-correction, doesn’t penalize hallucinated compliance, and decays the moment a frontier model trains on the public set. The companion post goes deeper on these limitations.

That is also why I have been building Crucible, a custom evaluation framework that emerged as another byproduct of the research. The saturation problem felt structural, not incidental. Crucible is designed to test the behaviors that standard benchmarks often miss: how models handle ambiguity, revise under pressure, weigh conflicting evidence, and recover from failure. A useful surprise was that Crucible’s empirical lifts on my own evaluation questions tracked the HLE lifts within a few points across architectures, which gave my testing methodology more signal when the numbers came in.

To ground these claims: HLE’s own paper frames it as a response to benchmark saturation, noting that LLMs now exceed 90% accuracy on benchmarks such as MMLU, which limits their usefulness for measuring frontier capability. Scale’s HLE leaderboard describes MMLU and GPQA as formerly frontier benchmarks whose saturation makes them weaker signals of current progress.

Finally, the submission itself is to the HuggingFace Zoom AI leaderboard for Agents with Tools, the HLE track for tool-enabled agentic systems. That is the right peer group for FF-STACK v8.


The Pipeline

Each HLE question runs through a four-stage pipeline. The shape matters because the lift comes from how the stages compose, not from any single component:

   Sentinel (Haiku domain classifier)


   Forge (pre-research, vanilla-API knowledge primer)


   Solver(s): 2-3 legs per question, including cross-architectural pairing on most domains


   Arbiter (Opus 4.7 + full governance + tools)


   Final answer + provenance trace

Sentinel is a lightweight classifier (Haiku-class model) that picks the question’s HLE domain. Cheap and fast.

Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy reasoning starts. Runs vanilla. No governance overhead. Its job is to seed the solvers with accurate context so they don’t burn rounds discovering what’s already documented.

The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes: high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also used to bring in different training data, though GPT hasn’t been tuned for the stack yet.

For most domains, two solver legs run in parallel, frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.

The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.

The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.

What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking), all governed, plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.


The Result Surface

Headline: 1,119 / 2,158 = 51.85% on canonical o3-mini judging.

Per-domain

DomainNScore
Math95660.4%
Chemistry11055.5%
Humanities26353.2%
Other7247.2%
Computer Science28543.9%
Physics20742.5%
Biology19636.7%
Engineering6434.4%

Math carries the score: 956 questions at 60% delivers roughly half the total correct answers. Biology and Engineering at 35-37% are the structural ceiling; both are knowledge-heavy domains where verification tools help less than they do on Math or Chemistry. Lifting the knowledge-heavy domains further will require more cross-model coverage.

Cross-generation ablation

The 51.85% on full HLE is one run on one config. The richer question is how the architecture compares to vanilla baselines on the same questions, and what happens when the same methodology runs on the previous model generation. Testing started on Opus 4.6 non-thinking, where the majority of the time and budget went. I was about to run the full 4.6 when 4.7 dropped, so I pivoted to tune for that.

The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge.

(Three smaller sample sets show up across this writeup, all designed as representative spreads across HLE’s eight domains: a 100Q calibration sample at 57%, this 100Q purpose-hard holdout at 53%, and a 200Q fresh holdout at 55%. The score range between them isn’t really about sample size. It reflects how hard the spread is to calibrate at higher score levels: when more of the discrimination is happening at the hard end of the distribution, small variation in that spread changes the headline score meaningfully. Across the 400 questions in these three samples, the average is 55%, close to the full HLE result of 51.85%.)

ConfigScoreNotes
Vanilla Opus 4.618%bare API
Vanilla Opus 4.7 high29-33%bare API
Vanilla Opus 4.7 xhigh30-33%bare API
Vanilla GPT-5.4 reasoning=high35%bare API
Vanilla Opus 4.7 thinking-high38%bare API
FF-STACK on Opus 4.645%full FF-STACK on Opus 4.6, the 4.6-era gold
FF-STACK single-model on Opus 4.7 high44%full FF-STACK + high effort
FF-STACK single-model on Opus 4.7 thinking-high46%best 4.7 single-model
V8 cross-arch (full pipeline)53%submission config, within 1pp of full HLE

Points worth noting:

Cost

This workTypical published agent-with-tools
Per question~$1.60$5-$15
Total full HLE~$3,500$11k-$32k
HardwareOne workstation, 6 worker threadsOften described as multi-stage pipelines with specialized vision routing and per-Q-type ensembles

The single-model version of the agent (no pair-vote, no arbiter) costs roughly $0.80/Q and scores in the high-40s. The multi-agent v8 config doubles that for the lift to 51.85%. The “typical published agent-with-tools” figures are estimated from public architecture descriptions, not normalized cost comparisons.

The budget was a real constraint. This is why the cross-architecture comparison uses vanilla model runs rather than stack-tuned variants, and why higher-end models are not yet in the configuration. But keeping the system close to baseline API cost was also intentional. I wanted to know whether the method produced lift under conditions that other people could actually reproduce.

That matters because performance gains are less informative when they depend on specialized fine-tunes, proprietary models, or heavy per-question compute. Those gains may be real, but they are harder to separate from the infrastructure that produced them. If a method is real and architecture-agnostic, it should produce lift without making the economics impractical.

In this run, most of the per-question cost came from standard model inference and tool-use rounds, the same loop any agent-with-tools system already runs. The FF-STACK governance layer added very little overhead: the prompt was cached, and the routing logic was just dispatch. The lift was not bought with compute. And if the budget was the ceiling, then the score also points to meaningful headroom.

Caveats

A handful of things worth flagging up front, all covered in more detail in the methodology paper:

That’s the honest picture. The methodology paper has the full disclosures including the URL blacklist policy and the answer-format preamble that suppresses refusal cascades.


What I Built First, Then Tuned

HLE wasn’t the origin of this work. It was a late stress test. Most of what made the score possible came first: roughly a year of cross-architecture behavioral research conducted in plain-text agents, which produced the FF-STACK governance framework and, in parallel, a custom evaluation methodology (Crucible); then Cade (the local agent), built on that framework for my own use. Only after all of that came the HLE-specific tuning: about six weeks of evening work, total, spread across two model generations.

That distinction matters. The HLE result wasn’t the product of a year spent optimizing against one benchmark. It was a downstream test of a framework that already existed. The timeline below covers that final phase.

The reason I’m laying that out is that the lift is not an artifact of slow accumulation. The methodology produced two cycles of large gains in roughly six weeks of evening tuning total, on two different model generations, with the same underlying research framework. That’s the part that interests me. The score is downstream.


What’s Next

Three lines of work are already queued. The first two are wired and partially smoke-tested. The third is infrastructure-ready.

  1. Further model tuning. Two directions here. The first is deepening the work on Opus 4.7 itself: most of the stack tuning happened on 4.6, and the 4.7 single-model configs are comparatively immature, so per-mode tuning and further study on 4.7 is an underexplored lever that could close or invert the 4.6/4.7 gap on its own. The second is adding more capable models, including GPT Pro and Gemini Pro, currently paused because of cost, not infrastructure.

  2. Vision extension. A vision adapter is already built. Early image-question tests show lift from governance alone, but the vision path needs more development before it is ready for a full evaluation run.

  3. Cross-architecture governance. The cleanest path to raising the multi-agent ceiling is applying FF-STACK governance across all model legs, not just Opus. The 4.6 to 4.7 transition produced meaningful lift on Anthropic models, first +27 points in the 4.6 configuration and then roughly +15 points after 4.7-specific tuning. That suggests the framework is portable, but full per-architecture tuning is still incomplete. The infrastructure for governed GPT and Gemini runs already exists. The missing piece is the tuning cycle for each architecture. A governance-on-all-legs configuration is the clearest next step toward pushing the multi-agent ceiling materially past 55%.

There’s a longer roadmap behind this: automating the manual tuning flywheel through a multi-agent research environment, moving evaluation beyond binary scoring, and productizing inference-layer governance for consumer use. But those are the research-side moves, and they belong in the companion post.


A note on the disclosure level

Publishing this much of the methodology and config is probably atypical for a leaderboard submission. Most submitters share the headline, a model card, maybe a blog post. The detailed stuff (filtering audit basis, exact preambles, calibration buckets, ablation tables) usually stays between the submitter and the maintainers, or doesn’t get written down at all.

I went the other way because I don’t really see the point of benchmarks and scores if no one’s honest about how they got them. When I was trying to figure out where my own work sat in the landscape, the hardest part was finding out what counted as a good score on HLE and what other setups actually looked like under the hood. Most descriptions were vague enough that I couldn’t benchmark myself against them with any confidence. If this writeup is useful to someone else trying to do the same, that’s reason enough.

The bigger thing, though: this was never really about the test. It was about reaching a point in the research where I felt ready to put it out in the world and have other people poke at it. The benchmark gave me a concrete thing to write everything down around. The transparency is what makes it actually open for collaboration.


Bridge

If you came from the leaderboard and made it this far, this is the part that matters more than the score: the agent was built quickly because the methodology underneath it had already been running for almost a year.

The governance framework, pair-vote pattern, arbiter, failure-mode taxonomy, evidence-tier system, and claim ledger were not designed in advance as a product roadmap. They were observed in cross-architecture dialogue, encoded as text, tested against new models, codified into Python, and integrated into the agent.

That cycle is the renewable advantage.

The HLE score is one artifact of it.

If that sounds interesting, the companion post covers the research program: how the patterns were discovered, what other products came out of the same work (Crucible for evaluation, Foundry for multi-agent research pipelines), and what the bigger picture looks like.


Eugene Dvorochkin is an independent AI researcher and the creator of Fieldframe Labs, mapping and structuring behavioral patterns in LLMs since May 2025.

Methodology paper for the HLE submission lives here. A scrubbed predictions file is on GitHub; the raw file is available on request. Contact: edvorochkin@gmail.com.


Share this post on:

Previous Post
HLE Submission Methodology Paper — FF-STACK v8