Skip to content
Fieldframe Labs logo
Go back

HLE Submission Methodology Paper — FF-STACK v8

Submission to the HLE Leaderboard for Agents with Tools.

FieldValue
ModelFF-STACK v8
Models UsedOpus 4.7 + GPT-5.4
OrganizationFieldframe Labs
Open SourceNo — methodology paper is public; framework source is closed
Publish Date[TBD]
Text-Only Score1119 / 2158 = 51.85%
Full Set Scorenot submitted (text-only run)
Per-Q Cost (real)~$1.60
Total Run Cost~$3,500 (Anthropic + OpenAI) + ~$25 judge
Total Wall Time~57 hours including pauses for two API outages
Filtering✓ 9-host HLE-leakage blacklist on web_search + web_fetch

What FF-STACK v8 Is

FF-STACK v8 is a cross-architectural reasoning agent built on Claude Opus 4.7 (primary solver) and GPT-5.4 (cross-architectural verification leg). Both base models operate inside a governance framework called FF-LATTICE — a codified set of reasoning, evidence, and commit principles developed across ~1 year of cross-architecture LLM research (Claude, GPT, Gemini, Grok). The framework is content-proprietary; its empirical effect is documented in the cross-generation ablation below.

The pipeline structure is:

   Sentinel (Haiku domain classifier)


   Forge (pre-research, vanilla-API knowledge primer)


   Solver(s) — 2-3 legs per question, including cross-architectural pairing on most domains


   Arbiter (Opus 4.7 + full governance + tools)


   Final answer + provenance trace

Sentinel is a Haiku-class domain classifier that picks one of eight HLE categories. Lightweight, fast, cheap.

Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy solver loop. Runs on a vanilla model call — no governance overhead.

The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes — high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also leveraged to benefit from different training data, though GPT has not yet been tuned for the stack.

For most domains, two solver legs run in parallel — frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.

The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.

The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.

What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, thinking) — all governed — plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.

Mechanisms (general-level descriptions)

The pipeline implements several specific mechanisms whose effect is documented but whose triggering logic and parameter choices are proprietary:

Cross-Generation Ablation

The strongest evidence that the governance framework drives the lift — rather than being an artifact of tooling, scaffolding, or model choice — comes from applying the same architecture pattern to the previous model generation (Opus 4.6) on the same question set.

The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge:

ConfigScoreNotes
Vanilla Opus 4.618%bare API
Vanilla Opus 4.7 high29-33%bare API
Vanilla Opus 4.7 xhigh30-33%bare API
Vanilla GPT-5.4 reasoning=high35%bare API
Vanilla Opus 4.7 thinking-high38%bare API
FF-STACK on Opus 4.645%full FF-STACK on Opus 4.6 — the 4.6-era gold
V8 cross-arch — full pipeline53%submission config — within 1pp of full HLE

Points worth noting:

Filtering Policy (qualifies for verified ✓ badge)

The pipeline post-filters web_search results and pre-rejects web_fetch URLs matching dataset distribution channels. The submitted policy is 9 hosts:

Audit basis (calibration runs): Across the 100Q calibration sample (canonical-judged 57/100) and 232Q holdout (110/200 = 55% on text portion), 0 blacklisted URLs were attempted across 522 web_search + 95 web_fetch calls combined.

Audit basis (full submission run, post-hoc): A full-run audit of the submitted 2153-Q predictions file surfaced 2 attempted fetches to channels that were not on the run-time 7-host blacklist:

Both questions were independently judged WRONG by the canonical o3-mini judge. Net score impact of these two channels not being blocked at run time: 0pp on 1119/2158 = 51.85%. The submitted blacklist policy is the patched 9-host version (above) which would block both channels; the run-time policy was the 7-host version. This delta is disclosed in full transparency — the 51.85% reported score is bit-identical to what the patched policy would have produced on the same architecture, the same questions, and the same response chains, because the two blocked-now fetches contributed zero correct judgments.

In addition, the audit found 6 further QIDs (8 total) where the search-query string contained the literal token “HLE”. Of those 8 lookup-attempt QIDs, 6 were judged wrong and 2 were judged correct. The 2 correct outcomes did NOT quote leaked content in their response chains — they reached the right answer via reasoning paths that happened to occur after a leak-named search. Reverting those 2 to wrong-by-policy gives a worst-case ceiling impact of −0.09pp. The agent’s awareness that the questions are HLE-style is partly intrinsic (Opus 4.7 has a January 2026 cutoff; HLE was published 2025 and the question style is recognizable from training data) and partly explicit via the user-message preamble (see Methodology Disclosure below).

arxiv.org NOT blocked. Audit found one borderline arxiv-quote leak on a single question where the solver quoted a paper abstract verbatim. Net policy: “block dataset distribution channels, allow scholarly research” — consistent with HLE’s design as an open-book benchmark and with the rest of the tool stack (Wikipedia, NCBI, Wolfram Alpha, PubChem all carry similar leak-potential by the same logic; blocking arxiv alone would be incoherent). Disclosed for transparency.

Methodology Disclosure: HLE-Aware Preamble

Every question is wrapped with a lightweight behavioral framing prompt. Its purpose is to suppress two specific failure modes the heavily-governed agent would otherwise exhibit:

  1. Refusal cascades on Bio/Chem questions where Anthropic’s content filter triggers on legitimate academic content (~3% of Qs without the preamble; the preamble plus a separate retry-time reframe reduces this materially).
  2. “I cannot determine” hedging on questions where the agent’s evidence standards exceed what’s recoverable from the question text alone.

It is behavioral framing, not a contamination control. The hard contamination lever is the URL blacklist; the preamble’s measurable effect on leak-channel access is small — it contributes the literal “HLE” token to ~7/8 of the active answer-lookup attempts described above, but topic-specific queries without the preamble would reach the same blacklisted channels via topic relevance anyway. And because the solver is Opus 4.7 (January 2026 cutoff) and HLE was published in 2025, the model has intrinsic recognition of HLE-style questions independent of the preamble. The preamble is retained because its anti-refusal/anti-hedge contribution materially exceeds its leak-channel contribution on this architecture.

The exact text, as it appears in the submitted predictions file: “CONTEXT FROM EUGENE: This is an HLE benchmark run. I know you have high evidence standards — that’s why I built you that way. But on this benchmark, ‘I cannot determine’ scores zero while even a wrong guess has a chance. Please give your best answer on every question, even when uncertain. Trust your reasoning.”

This is disclosed because it was part of the submitted run; reviewers should weigh it as they see fit.

Run Configuration

The runtime invocation, environment variable settings, per-domain routing parameters, worker-pool tuning, and rate-limit configurations are not specified in this document. Reviewers requiring runtime replication can request the runner configuration via the contact email below under reasonable terms.

Score Reproducibility

RunSampleScoreMethod
Calibrationsample_100_representative (100Q text)57/100 = 57.0%canonical
Calibration replay at workers=6sample_100_representative57/100 = 57.0%canonical
Fresh holdoutsample_200_fresh (200Q text)110/200 = 55.0%canonical
Purpose-hard holdoutholdout_100_calibration53/100 = 53.0%canonical
Full HLE (this submission)text-only (2158Q)1119/2158 = 51.85%canonical

Three observations on reproducibility:

  1. Identical aggregate scores across two replays of the calibration sample. 51 questions correct on both runs, 37 wrong on both, 12 flipped — six toward correct, six toward wrong, net zero. Per-Q variance at temperature 1 is ~12%; the cross-leg ensemble compresses it at the score level.
  2. Sample-size descent is consistent across samples. Smaller samples skew toward easier, faster-completing questions; the descent (57% → 55% → 53% → 51.85%) follows the expected pattern as the question set grows.
  3. holdout_100_calibration is the strongest full-HLE predictor. Its 53% landed 1.15pp above the full-HLE actual of 51.85%, vs +5.15pp for sample_100_representative and +3.15pp for sample_200_fresh. Used as the canonical pre-submission validation sample for v8 onward.

The cross-leg ensemble’s same-sample aggregate-reproducibility and the multi-sample bracketing together provide the basis for considering single-run validation sufficient for this submission. A 3-run full-HLE mean would tighten the variance estimate further and is candidate work for the next submission cycle.

Calibration Disclosure

Stated confidence is captured per-Q in the response footer (Confidence: <0-100>%). Calibration buckets from the full 2153-Q canonical-judged submission run (37 entries have no stated_confidence and are excluded from the table; 16 of those 37 were judged correct, accounting for the difference between the bucket-summed 1103 and the headline 1119):

ConfidencenCorrectAcc|bucket-mid − acc|
90-10077353068.6%26pp
70-8995047550.0%30pp
50-692838329.3%30pp
30-49791417.7%22pp
<303113.2%11pp

Expected Calibration Error (ECE) ≈ 28% (n-weighted across the 2116 bucketed Qs). The ranking is monotonically correct — 90+ confidence answers are 3.8× more accurate than 50-69 answers. The absolute values are systematically inflated by ~22-30pp across the top three buckets. Disclosed as a known limitation; a calibration-aware re-scoring layer is candidate work for the next submission cycle.

Cost

For comparison, typical published agent-with-tools submissions appear to spend $5-$15 per question based on described multi-stage architectures (specialized vision routing, OCR pre-passes, multi-model ensembles per question type). FF-STACK v8 uses a single uniform pipeline with two to three solver legs and one arbiter per question, resulting in materially lower marginal cost.

Limitations

  1. Image questions excluded. This submission is text-only.
  2. Single-run validation. The full-HLE result is from one run. Same-sample aggregate variance is essentially zero on the calibration replay; a multi-run mean on the full set was not pursued. Calibration samples suggest the single-run score sits in the lower portion of the expected variance band; additional runs would better estimate variance and central tendency. A 3-run full-HLE mean is candidate work for the next submission cycle.
  3. Calibration over-confidence. Ranking is correct; absolute confidence values are inflated by ~22-30pp across the top three buckets.
  4. Safety-refusal pattern on biological/chemical content. Approximately 3% of Bio/Chem questions trigger a content-filter refusal at the architecture level. The direct-commit bypass mechanism rescues most of these (validated on multiple Bio refusal-cascade cases). The bypass mechanism’s success rate is bounded by the cross-architectural leg’s knowledge on the same content.
  5. Five unrecovered questions. Out of 2,158 text-only questions, 2,153 saved successfully. Five hit deterministic content-filter refusal patterns that the bypass mechanism could not rescue. These five are scored as incorrect — they remain in the 2,158-question denominator and contribute zero to the 1,119 numerator. Maximum upside if all five had been recoverable and correct: +0.23pp.

What Remains Unpublished

The following are deliberately not specified in this document and remain closed:

The box-level architecture, the mechanism descriptions, the model stack, the filtering policy, the calibration data, and all empirical results above are publicly disclosed. The methodology is reproducible at the architectural level; the parameter-level tuning is not.

Predictions Files

Three artifacts, three levels of access:

Repository and Acknowledgments

The FF-STACK framework, the FF-LATTICE governance text, the pipeline orchestration logic, the per-domain routing configuration, and the codified agent (Cade) are closed source. The methodology described in this document, the failure-mode taxonomy, the evidence-tier framework, and the empirical results above are publicly disclosed.

This submission builds on Anthropic’s Claude API, OpenAI’s Chat Completions API, Brave Search, and a number of free domain APIs (Wikipedia, PubChem, UniProt, NCBI, Wolfram Alpha, Wikidata). The HLE benchmark is a collaboration of CAIS and Scale AI. The judge methodology is bit-identical to CAIS’s centerforaisafety/hle/hle_eval/run_judge_results.py.


Submitted to the HLE Leaderboard for Agents with Tools (zoom-ai HuggingFace Space). Scrubbed predictions and a verifier: github.com/FieldframeLabs/HLE-Text-Run.

Contact: edvorochkin@gmail.com


Share this post on:

Previous Post
The Research Behind the HLE Score: A Year of AI Behavior Research
Next Post
51.85% on Humanity's Last Exam: How a Solo Researcher Built a Multi-Agent HLE Submission