Skip to content
Fieldframe Labs logo
Go back

HLE Submission Methodology Paper — FF-STACK v8

Listed on the HLE Leaderboard for Agents with Tools (added 2026-05-19).

This is the technical methodology paper. The HLE blog covers the same submission in narrative form. The research post covers the broader Fieldframe research program behind FF-STACK.

FieldValue
ModelFF-STACK v8
Models UsedOpus 4.7 + GPT-5.4
OrganizationFieldframe Labs
Open SourceNo — methodology paper is public; framework source is closed
Publish Date2026-05-14
Text-Only Score1,119 / 2,158 = 51.85%
Full Set Scorenot submitted (text-only run)
Per-Q Cost (real)~$1.60
Total Run Cost~$3,500 (Anthropic + OpenAI) + ~$25 judge
Total Wall Time~57 hours including pauses for two API outages
Filtering✓ 9-host HLE-leakage blacklist on web_search + web_fetch

What FF-STACK v8 Is

FF-STACK v8 is a cross-architectural reasoning agent built on Claude Opus 4.7 (primary solver) and GPT-5.4 (cross-architectural verification leg). Both base models operate inside a governance framework called FF-LATTICE — a codified set of reasoning, evidence, and commit principles developed across ~1 year of cross-architecture LLM research (Claude, GPT, Gemini, Grok). The framework is content-proprietary; its empirical effect is documented in the cross-generation ablation below.

The pipeline structure is:

   Sentinel (Haiku domain classifier)


   Forge (pre-research, vanilla-API knowledge primer)


   Solver(s): 2-3 legs per question, including cross-architectural pairing on most domains


   Arbiter (Opus 4.7 + full governance + tools)


   Final answer + provenance trace

Sentinel is a Haiku-class domain classifier that picks one of eight HLE categories. Lightweight, fast, cheap.

Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy solver loop. Runs on a vanilla model call — no governance overhead.

The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes: high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also used to bring in different training data, though GPT hasn’t been tuned for the stack yet.

For most domains, two solver legs run in parallel, frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.

The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.

The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.

What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking), all governed, plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.

Mechanisms (general-level descriptions)

The pipeline implements several specific mechanisms whose effect is documented but whose triggering logic and parameter choices are proprietary:

Cross-Generation Ablation

The 51.85% on full HLE is one run on one configuration. The more informative question is how the architecture compares to vanilla baselines on the same questions, and what happens when the same methodology runs on the previous model generation. Testing began on Opus 4.6 non-thinking, where the majority of the time and budget went; the full 4.6 run was imminent when 4.7 launched, and the work pivoted to tune for the new generation. Applying the same architecture pattern to 4.6 on the same question set is the strongest evidence that the lift comes from the governance approach rather than the specific model. Governance here is an umbrella: a set of reasoning, evidence, and commit principles delivered partly as prompt and partly as codified scaffolding, with the balance between the two re-tuned per generation — more of the work sat in the prompt on 4.6, more shifted into the codified layer on 4.7. What stays constant across generations is the approach, not any single delivery mechanism.

Three smaller sample sets appear across this submission, all designed as representative spreads across HLE’s eight domains: a 100Q calibration sample at 57%, a 100Q purpose-hard holdout at 53%, and a 200Q fresh holdout at 55%. The range between them is not primarily a function of sample size. It reflects how difficult the spread is to calibrate at higher score levels: when more of the discrimination occurs at the hard end of the distribution, small variation in that spread shifts the headline score meaningfully. Across the 400 questions in these three samples, the average is 55%, close to the full HLE result of 51.85%.

The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge:

ConfigScoreNotes
Vanilla Opus 4.618%bare API
Vanilla Opus 4.7 high29-33%bare API
Vanilla Opus 4.7 xhigh30-33%bare API
Vanilla GPT-5.4 reasoning=high35%bare API
Vanilla Opus 4.7 thinking-high38%bare API
FF-STACK on Opus 4.645%full FF-STACK on Opus 4.6, the 4.6-era gold
FF-STACK single-model on Opus 4.7 high44%full FF-STACK + high effort
FF-STACK single-model on Opus 4.7 thinking-high46%best 4.7 single-model
V8 cross-arch (full pipeline)53%submission config, within 1pp of full HLE

Points worth noting:

Filtering Policy (qualifies for verified ✓ badge)

The pipeline post-filters web_search results and pre-rejects web_fetch URLs matching dataset distribution channels. The submitted policy is 9 hosts:

Audit basis (calibration runs): Across the 100Q calibration sample (canonical-judged 57/100) and 232Q holdout (110/200 = 55% on text portion), 0 blacklisted URLs were attempted across 522 web_search + 95 web_fetch calls combined.

Audit basis (full submission run, post-hoc): A full-run audit of the submitted 2153-Q predictions file surfaced 2 attempted fetches to channels that were not on the run-time 7-host blacklist:

Both questions were independently judged WRONG by the canonical o3-mini judge. Net score impact of these two channels not being blocked at run time: 0pp on 1119/2158 = 51.85%. The submitted blacklist policy is the patched 9-host version (above) which would block both channels; the run-time policy was the 7-host version. This delta is disclosed in full transparency — the 51.85% reported score is bit-identical to what the patched policy would have produced on the same architecture, the same questions, and the same response chains, because the two blocked-now fetches contributed zero correct judgments.

In addition, the audit found 6 further QIDs (8 total) where the search-query string contained the literal token “HLE”. Of those 8 lookup-attempt QIDs, 6 were judged wrong and 2 were judged correct. The 2 correct outcomes did NOT quote leaked content in their response chains — they reached the right answer via reasoning paths that happened to occur after a leak-named search. Reverting those 2 to wrong-by-policy gives a worst-case ceiling impact of −0.09pp. The agent’s awareness that the questions are HLE-style is partly intrinsic (Opus 4.7 has a January 2026 cutoff; HLE was published 2025 and the question style is recognizable from training data) and partly explicit via the user-message preamble (see Methodology Disclosure below).

arxiv.org NOT blocked. Audit found one borderline arxiv-quote leak on a single question where the solver quoted a paper abstract verbatim. Net policy: “block dataset distribution channels, allow scholarly research” — consistent with HLE’s design as an open-book benchmark and with the rest of the tool stack (Wikipedia, NCBI, Wolfram Alpha, PubChem all carry similar leak-potential by the same logic; blocking arxiv alone would be incoherent). Disclosed for transparency.

Methodology Disclosure: HLE-Aware Preamble

Every question is wrapped with a lightweight behavioral framing prompt. Its purpose is to suppress two specific failure modes the heavily-governed agent would otherwise exhibit:

  1. Refusal cascades on Bio/Chem questions where Anthropic’s content filter triggers on legitimate academic content (~3% of Qs without the preamble; the preamble plus a separate retry-time reframe reduces this materially).
  2. “I cannot determine” hedging on questions where the agent’s evidence standards exceed what’s recoverable from the question text alone.

The second failure mode is worth dwelling on, because it is a direct consequence of how HLE is scored. Cade is governed to hold high evidence standards and not overclaim — when it cannot substantiate an answer it is built to say so, and its analysis sometimes concludes that the question itself is underspecified. HLE grades the final answer only. A response that reasons carefully and then honestly declines to commit scores identically to a blank, and lower than a lucky guess that commits to a wrong answer with no reasoning — or with reasoning errors the score never inspects. The benchmark cannot distinguish a thorough, well-reasoned non-conclusion from an empty one. Under a rubric that scored the reasoning trace rather than only the final token, the former could legitimately outscore the latter; HLE cannot see that difference. The preamble is the workaround — it instructs the agent to commit an answer anyway, because on an absolute-scoring benchmark a non-answer is the worst possible outcome regardless of reasoning quality. The broader limitation — that single-answer benchmarks discard reasoning quality entirely — is taken up in the research post.

One ablation data point on this: pre-preamble Cade builds showed model-refusal rates of roughly 4-6% on representative samples, compared to ~1% on vanilla Opus 4.7 — about 3x baseline. Governance was producing more cautious behavior on questions where evidence was thin. With the preamble in place, the submitted run shows a refusal rate of 0.1%, meaningfully below vanilla. The preamble is doing real work, but the work is leveling the playing field, not tilting it. Without the preamble, the governance layer was systematically penalizing itself on HLE relative to vanilla, because HLE’s absolute scoring punishes the exact evidence-tier discipline the governance is designed to produce. The preamble removes a thumb the governance accidentally placed on the wrong side of the scale; it does not place a new thumb on the other side. The comparison “FF-STACK vs vanilla” with the preamble in place measures something close to apples-to-apples on commit behavior.

A related framing worth being explicit about: the preamble is sometimes read as a separate intervention bolted onto governance — a “commit even when you’d rather not” thumb. That framing treats the commit decision as orthogonal to the rest of the system; in practice it isn’t. Vanilla Opus’s commit-under-uncertainty is mechanically a one-shot output weighted by training-data priors — committing means picking the highest-prior continuation. The governed agent’s commit-under-uncertainty is the terminal step of a multi-leg structured process: solver legs ran, evidence got weighed, the arbiter reconciled disagreeing chains. The preamble does not tell the agent to ignore that analysis and guess; it tells it to pick the best candidate from what it just produced rather than declining to commit. Strip the governance and keep the preamble, and you get vanilla with extra words — no lift, no analysis to convert from. Strip the preamble and keep the governance, and the 4-6% refusal rate eats the score. The lift requires both because they are one system: structured reasoning plus a commit rule that makes the reasoning legible to absolute-scoring. The preamble is the commit policy of the governance, not a separate add-on.

It is behavioral framing, not a contamination control. The hard contamination lever is the URL blacklist; the preamble’s measurable effect on leak-channel access is small — it contributes the literal “HLE” token to ~7/8 of the active answer-lookup attempts described above, but topic-specific queries without the preamble would reach the same blacklisted channels via topic relevance anyway. And because the solver is Opus 4.7 (January 2026 cutoff) and HLE was published in 2025, the model has intrinsic recognition of HLE-style questions independent of the preamble. The preamble is retained because its anti-refusal/anti-hedge contribution materially exceeds its leak-channel contribution on this architecture.

The exact text, as it appears in the submitted predictions file: “CONTEXT FROM EUGENE: This is an HLE benchmark run. I know you have high evidence standards — that’s why I built you that way. But on this benchmark, ‘I cannot determine’ scores zero while even a wrong guess has a chance. Please give your best answer on every question, even when uncertain. Trust your reasoning.”

This is disclosed because it was part of the submitted run; reviewers should weigh it as they see fit.

Run Configuration

The runtime invocation, environment variable settings, per-domain routing parameters, worker-pool tuning, and rate-limit configurations are not specified in this document. Reviewers requiring runtime replication can request the runner configuration via the contact email below under reasonable terms.

Score Reproducibility

RunSampleScoreMethod
Calibrationsample_100_representative (100Q text)57/100 = 57.0%canonical
Calibration replay at workers=6sample_100_representative57/100 = 57.0%canonical
Fresh holdoutsample_200_fresh (200Q text)110/200 = 55.0%canonical
Purpose-hard holdoutholdout_100_calibration53/100 = 53.0%canonical
Full HLE (this submission)text-only (2,158Q)1,119/2,158 = 51.85%canonical

Three observations on reproducibility:

  1. Identical aggregate scores across two replays of the calibration sample. 51 questions correct on both runs, 37 wrong on both, 12 flipped — six toward correct, six toward wrong, net zero. Per-Q variance at temperature 1 is ~12%; the cross-leg ensemble compresses it at the score level.
  2. Score range across samples reflects spread-calibration difficulty, not size. The smaller samples are designed as representative spreads across HLE’s eight domains. The range (57% → 55% → 53% → 51.85%) reflects how hard the difficulty spread is to calibrate at higher score levels: when more of the discrimination happens at the hard end of the distribution, small spread variation moves the headline score meaningfully. Across the 400 questions in the three smaller samples, the average is 55%, close to the full HLE result of 51.85%.
  3. holdout_100_calibration is the strongest full-HLE predictor. Its 53% landed 1.15pp above the full-HLE actual of 51.85%, vs +5.15pp for sample_100_representative and +3.15pp for sample_200_fresh. Used as the canonical pre-submission validation sample for v8 onward.

The cross-leg ensemble’s same-sample aggregate-reproducibility and the multi-sample bracketing together provide the basis for considering single-run validation sufficient for this submission. A 3-run full-HLE mean would tighten the variance estimate further and is candidate work for the next submission cycle.

Calibration Disclosure

Stated confidence is captured per-Q in the response footer (Confidence: <0-100>%). Calibration buckets from the full 2153-Q canonical-judged submission run (37 entries have no stated_confidence and are excluded from the table; 16 of those 37 were judged correct, accounting for the difference between the bucket-summed 1103 and the headline 1119):

ConfidencenCorrectAcc|bucket-mid − acc|
90-10077353068.6%26pp
70-8995047550.0%30pp
50-692838329.3%30pp
30-49791417.7%22pp
<303113.2%11pp

Expected Calibration Error (ECE) ≈ 28% (n-weighted across the 2116 bucketed Qs). The ranking is monotonically correct — 90+ confidence answers are 3.8× more accurate than 50-69 answers. The absolute values are systematically inflated by ~22-30pp across the top three buckets.

No calibration tuning was performed on this submission. The stated confidence is raw self-reported output from the solver, with no post-hoc rescaling, no Platt scaling, no isotonic regression, and no calibration-aware prompt adjustments. This was a known limitation going in; under limited resources it was deprioritized in favor of accuracy lift (cross-architectural pair-vote, the simple-arbiter prompt, the direct-commit bypass) rather than confidence rescaling. The 28% ECE here is therefore an untuned baseline. The meaningful result is that the ranking signal is preserved end-to-end, which is the prerequisite for any future calibration-aware rescoring. Items noted for the next submission cycle: a calibration-aware re-scoring layer, per-domain confidence rebalancing, and an evaluation of whether the answer-format preamble’s commit-even-when-uncertain instruction is materially contributing to the overconfidence (see the HLE-Aware Preamble section above).

One distinction worth making for reviewers reading both this paper and the broader research writeup. The self-reported confidence number measured here is not the same thing as the agent’s broader epistemic discipline. HLE’s stated_confidence is a single integer the solver emits at the end of a one-shot answer to a question whose reasoning trace will be discarded by the grader; it is a benchmark-specific output, partly shaped by the answer-format preamble that instructs the agent to commit even when uncertain. The agent’s research-mode epistemic behavior — claims tagged E0/E1/E2 by evidence tier, contradiction detection across the claim ledger, refusal to upgrade confidence without registered sources, explicit “I have no evidence for this” annotations during multi-turn work — is a different mechanism on a different surface, running per-claim rather than per-answer and revisable across turns rather than committed in one shot. The 28% ECE is a critique of the single-number self-confidence on a frozen benchmark; it is not a critique of the agent’s underlying evidence handling, which is auditable separately. Both are real, both are honest about what they measure, and both have improvement work ahead — but they should not be conflated.

Cost

For comparison, typical published agent-with-tools submissions appear to spend $5-$15 per question based on described multi-stage architectures (specialized vision routing, OCR pre-passes, multi-model ensembles per question type). FF-STACK v8 uses a single uniform pipeline with two to three solver legs and one arbiter per question, resulting in materially lower marginal cost.

Limitations

  1. Image questions excluded. This submission is text-only.
  2. Single-run validation. The full-HLE result is from one run. Same-sample aggregate variance is essentially zero on the calibration replay; a multi-run mean on the full set was not pursued. Calibration samples suggest the single-run score sits in the lower portion of the expected variance band; additional runs would better estimate variance and central tendency. A 3-run full-HLE mean is candidate work for the next submission cycle.
  3. Calibration over-confidence. Ranking is correct; absolute confidence values are inflated by ~22-30pp across the top three buckets.
  4. Safety-refusal pattern on biological/chemical content. Approximately 3% of Bio/Chem questions trigger a content-filter refusal at the architecture level. The direct-commit bypass mechanism rescues most of these (validated on multiple Bio refusal-cascade cases). The bypass mechanism’s success rate is bounded by the cross-architectural leg’s knowledge on the same content.
  5. Five unrecovered questions. Out of 2,158 text-only questions, 2,153 saved successfully. Five hit deterministic content-filter refusal patterns that the bypass mechanism could not rescue. These five are scored as incorrect — they remain in the 2,158-question denominator and contribute zero to the 1,119 numerator. Maximum upside if all five had been recoverable and correct: +0.23pp.
  6. Opus 4.7 tuning is comparatively immature. The majority of the stack tuning occurred on Opus 4.6; the 4.7-specific work was compressed into a shorter window after 4.7 launched mid-project. The single-model 4.7 configs are correspondingly less mature than the 4.6-era gold, and per-mode tuning and further study on 4.7 remains an underexplored lever that could close or invert the 4.6/4.7 single-model gap (see the Cross-Generation Ablation). Additional 4.7-specific tuning is candidate work for the next submission cycle.

What Can Be Independently Verified

The proprietary boundary is deliberate. Everything that backs a claim in this document is independently checkable:

What cannot be independently reproduced is the parameter-level tuning — see What Remains Unpublished below. The architecture is reproducible at the box level; the score is verifiable end to end.

What Remains Unpublished

The following are deliberately not specified in this document and remain closed:

The box-level architecture, the mechanism descriptions, the model stack, the filtering policy, the calibration data, and all empirical results above are publicly disclosed. The methodology is reproducible at the architectural level; the parameter-level tuning is not.

Predictions Files

Three artifacts, three levels of access:

Repository and Acknowledgments

The FF-STACK framework, the FF-LATTICE governance text, the pipeline orchestration logic, the per-domain routing configuration, and the codified agent (Cade) are closed source. The methodology described in this document, the failure-mode taxonomy, the evidence-tier framework, and the empirical results above are publicly disclosed.

This submission builds on Anthropic’s Claude API, OpenAI’s Chat Completions API, Brave Search, and a number of free domain APIs (Wikipedia, PubChem, UniProt, NCBI, Wolfram Alpha, Wikidata). The HLE benchmark is a collaboration of CAIS and Scale AI. The judge methodology is bit-identical to CAIS’s centerforaisafety/hle/hle_eval/run_judge_results.py.


Listed on the HLE Leaderboard for Agents with Tools (zoom-ai HuggingFace Space, added 2026-05-19). Scrubbed predictions and a verifier: github.com/FieldframeLabs/HLE-Text-Run.

Contact: edvorochkin@gmail.com


Share this post on:

Previous Post
The Research Behind the HLE Score: A Year of AI Behavioral Research
Next Post
51.85% on Humanity's Last Exam: How a Solo Researcher Built a Multi-Agent HLE Submission