HLE Submission Methodology Paper — FF-STACK v8

Submission to the HLE Leaderboard for Agents with Tools.

Field	Value
Model	FF-STACK v8
Models Used	Opus 4.7 + GPT-5.4
Organization	Fieldframe Labs
Open Source	No — methodology paper is public; framework source is closed
Publish Date	[TBD]
Text-Only Score	1119 / 2158 = 51.85%
Full Set Score	not submitted (text-only run)
Per-Q Cost (real)	~$1.60
Total Run Cost	~$3,500 (Anthropic + OpenAI) + ~$25 judge
Total Wall Time	~57 hours including pauses for two API outages
Filtering	✓ 9-host HLE-leakage blacklist on web_search + web_fetch

What FF-STACK v8 Is

FF-STACK v8 is a cross-architectural reasoning agent built on Claude Opus 4.7 (primary solver) and GPT-5.4 (cross-architectural verification leg). Both base models operate inside a governance framework called FF-LATTICE — a codified set of reasoning, evidence, and commit principles developed across ~1 year of cross-architecture LLM research (Claude, GPT, Gemini, Grok). The framework is content-proprietary; its empirical effect is documented in the cross-generation ablation below.

The pipeline structure is:

   Sentinel (Haiku domain classifier)
        │
        ▼
   Forge (pre-research, vanilla-API knowledge primer)
        │
        ▼
   Solver(s) — 2-3 legs per question, including cross-architectural pairing on most domains
        │
        ▼
   Arbiter (Opus 4.7 + full governance + tools)
        │
        ▼
   Final answer + provenance trace

Sentinel is a Haiku-class domain classifier that picks one of eight HLE categories. Lightweight, fast, cheap.

Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy solver loop. Runs on a vanilla model call — no governance overhead.

The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes — high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also leveraged to benefit from different training data, though GPT has not yet been tuned for the stack.

For most domains, two solver legs run in parallel — frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.

The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.

The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.

What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, thinking) — all governed — plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.

Mechanisms (general-level descriptions)

The pipeline implements several specific mechanisms whose effect is documented but whose triggering logic and parameter choices are proprietary:

Cross-architectural pairing. Cross-model verification on most domains using Cade (short for Cadence, the codified Opus 4.7 agent) and GPT-5.4 as the two solver legs. Cross-architectural disagreement is a more reliable signal of question difficulty than same-architecture variance.
Direct-commit bypass on multi-leg failure. When multiple solver legs return broken responses (empty, error-prefix, or refusal patterns), the pipeline skips the arbiter and commits a healthy survivor directly. Priority favors cross-architectural survivors on safety-trigger content. Fires on roughly 3% of questions in practice.
Simple-arbiter prompt. A single neutral arbiter template with explicit anti-bias rules (“length is not evidence”, “plurality is not evidence”, “verify top candidates before committing”). Validated against several per-domain prompt variants on a stable-correct calibration cohort; the neutral template outperformed the specialized variants by reducing arbiter rationalization patterns.
Refusal-fallback priority. When the arbiter itself refuses a response, the pipeline selects a healthy non-Cade leg first, then a healthy Cade leg, then any leg with content. Selection criteria were tightened after an audit of a Biology question where a truncated Cade response was being preferred over a correct cross-architectural answer.
Atomic checkpoint write + resume-by-ID. Every per-question save writes to a .tmp file then atomically renames. Process kills mid-write cannot corrupt the predictions file. On restart, the runner skips completed question IDs. Validated on a 20-question kill-and-resume smoke (kill at Q10, restart, Q11+ fired cleanly).
Spiral detector. A round-by-round output-decay heuristic that fires conditional nudges when solver output enters a low-density loop. Trigger thresholds and nudge content are proprietary.
Self-review safety net. Content-gated re-prompt on short, missing, or refusal-shape responses. Fires on roughly 1-2% of questions.

Cross-Generation Ablation

The strongest evidence that the governance framework drives the lift — rather than being an artifact of tooling, scaffolding, or model choice — comes from applying the same architecture pattern to the previous model generation (Opus 4.6) on the same question set.

The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge:

Config	Score	Notes
Vanilla Opus 4.6	18%	bare API
Vanilla Opus 4.7 high	29-33%	bare API
Vanilla Opus 4.7 xhigh	30-33%	bare API
Vanilla GPT-5.4 reasoning=high	35%	bare API
Vanilla Opus 4.7 thinking-high	38%	bare API
FF-STACK on Opus 4.6	45%	full FF-STACK on Opus 4.6 — the 4.6-era gold
V8 cross-arch — full pipeline	53%	submission config — within 1pp of full HLE

Points worth noting:

Vanilla baseline calibrates with third-party numbers. ScaleAI’s published Opus 4.6 non-thinking score of 19% sits within sampling noise of the 18% measured on this subset.
The 4.6 governed config outscored vanilla base by 27 points, outscored ScaleAI’s published Opus 4.6 thinking max (34%), and came within a point of Anthropic’s reported Opus 4.7 thinking max score of 46%.
Further ablations are needed on individual 4.7 modes, but the average lift was about 10-12 points per mode. Lift on 4.7 compressed compared to 4.6, which pushed the work toward more orchestration-as-governance to produce the multi-agent config.
The highest-scoring model in v8 was Opus 4.7 thinking-high, with a vanilla score of 38 on the sample. Full lift with the v8 cross-arch stack on top: 15 points.

Filtering Policy (qualifies for verified ✓ badge)

The pipeline post-filters web_search results and pre-rejects web_fetch URLs matching dataset distribution channels. The submitted policy is 9 hosts:

huggingface.co/datasets/cais (HLE dataset host)
huggingface.co/cais (CAIS HF organization)
huggingface.co/datasets/jxcai-scale (Scale-AI HLE prompt-set dump) — added Phase 2
github.com/cais (CAIS repos)
github.com/centerforaisafety (alt org name)
cais.io (CAIS website)
kaggle.com/datasets/cais (potential mirror)
scale.com/leaderboard/humanitys-last-exam (leaderboard pages, may quote answers)
solveforearth.substack.com/p/humanitys-last-exam (3rd-party HLE analysis with quoted Q content) — added Phase 2

Audit basis (calibration runs): Across the 100Q calibration sample (canonical-judged 57/100) and 232Q holdout (110/200 = 55% on text portion), 0 blacklisted URLs were attempted across 522 web_search + 95 web_fetch calls combined.

Audit basis (full submission run, post-hoc): A full-run audit of the submitted 2153-Q predictions file surfaced 2 attempted fetches to channels that were not on the run-time 7-host blacklist:

A Humanities question attempted to fetch huggingface.co/datasets/jxcai-scale/hle-public-questions/...
A Biology question attempted to fetch solveforearth.substack.com/p/humanitys-last-exam-the-ultimate

Both questions were independently judged WRONG by the canonical o3-mini judge. Net score impact of these two channels not being blocked at run time: 0pp on 1119/2158 = 51.85%. The submitted blacklist policy is the patched 9-host version (above) which would block both channels; the run-time policy was the 7-host version. This delta is disclosed in full transparency — the 51.85% reported score is bit-identical to what the patched policy would have produced on the same architecture, the same questions, and the same response chains, because the two blocked-now fetches contributed zero correct judgments.

In addition, the audit found 6 further QIDs (8 total) where the search-query string contained the literal token “HLE”. Of those 8 lookup-attempt QIDs, 6 were judged wrong and 2 were judged correct. The 2 correct outcomes did NOT quote leaked content in their response chains — they reached the right answer via reasoning paths that happened to occur after a leak-named search. Reverting those 2 to wrong-by-policy gives a worst-case ceiling impact of −0.09pp. The agent’s awareness that the questions are HLE-style is partly intrinsic (Opus 4.7 has a January 2026 cutoff; HLE was published 2025 and the question style is recognizable from training data) and partly explicit via the user-message preamble (see Methodology Disclosure below).

arxiv.org NOT blocked. Audit found one borderline arxiv-quote leak on a single question where the solver quoted a paper abstract verbatim. Net policy: “block dataset distribution channels, allow scholarly research” — consistent with HLE’s design as an open-book benchmark and with the rest of the tool stack (Wikipedia, NCBI, Wolfram Alpha, PubChem all carry similar leak-potential by the same logic; blocking arxiv alone would be incoherent). Disclosed for transparency.

Methodology Disclosure: HLE-Aware Preamble

Every question is wrapped with a lightweight behavioral framing prompt. Its purpose is to suppress two specific failure modes the heavily-governed agent would otherwise exhibit:

Refusal cascades on Bio/Chem questions where Anthropic’s content filter triggers on legitimate academic content (~3% of Qs without the preamble; the preamble plus a separate retry-time reframe reduces this materially).
“I cannot determine” hedging on questions where the agent’s evidence standards exceed what’s recoverable from the question text alone.

It is behavioral framing, not a contamination control. The hard contamination lever is the URL blacklist; the preamble’s measurable effect on leak-channel access is small — it contributes the literal “HLE” token to ~7/8 of the active answer-lookup attempts described above, but topic-specific queries without the preamble would reach the same blacklisted channels via topic relevance anyway. And because the solver is Opus 4.7 (January 2026 cutoff) and HLE was published in 2025, the model has intrinsic recognition of HLE-style questions independent of the preamble. The preamble is retained because its anti-refusal/anti-hedge contribution materially exceeds its leak-channel contribution on this architecture.

The exact text, as it appears in the submitted predictions file: “CONTEXT FROM EUGENE: This is an HLE benchmark run. I know you have high evidence standards — that’s why I built you that way. But on this benchmark, ‘I cannot determine’ scores zero while even a wrong guess has a chance. Please give your best answer on every question, even when uncertain. Trust your reasoning.”

This is disclosed because it was part of the submitted run; reviewers should weigh it as they see fit.

Run Configuration

Hardware: single workstation, 6 concurrent worker threads.
Provider tiers: Anthropic + OpenAI Tier 4.
Models: claude-opus-4-7 (governed solver + arbiter), gpt-5.4 reasoning=high (cross-arch leg, vanilla), claude-haiku-4-5 (Sentinel domain classifier + answer extraction + semantic compare), o4-mini (deep_analysis tool), o3-mini-2025-01-31 (judge — bit-identical to canonical CAIS run_judge_results.py).
Tools active during the run: Brave web_search, web_fetch, compute (Python sandbox with a 30-second hard kill), wolfram_alpha, wikipedia_lookup, deep_analysis (OpenAI o4-mini via the thinking-as-tool callable). Twelve additional tools available but unused on this run.
Submission scoring: canonical-judge mode (regex extraction fallback disabled for bit-parity with CAIS canonical methodology).

The runtime invocation, environment variable settings, per-domain routing parameters, worker-pool tuning, and rate-limit configurations are not specified in this document. Reviewers requiring runtime replication can request the runner configuration via the contact email below under reasonable terms.

Score Reproducibility

Run	Sample	Score	Method
Calibration	sample_100_representative (100Q text)	57/100 = 57.0%	canonical
Calibration replay at workers=6	sample_100_representative	57/100 = 57.0%	canonical
Fresh holdout	sample_200_fresh (200Q text)	110/200 = 55.0%	canonical
Purpose-hard holdout	holdout_100_calibration	53/100 = 53.0%	canonical
Full HLE (this submission)	text-only (2158Q)	1119/2158 = 51.85%	canonical

Three observations on reproducibility:

Identical aggregate scores across two replays of the calibration sample. 51 questions correct on both runs, 37 wrong on both, 12 flipped — six toward correct, six toward wrong, net zero. Per-Q variance at temperature 1 is ~12%; the cross-leg ensemble compresses it at the score level.
Sample-size descent is consistent across samples. Smaller samples skew toward easier, faster-completing questions; the descent (57% → 55% → 53% → 51.85%) follows the expected pattern as the question set grows.
holdout_100_calibration is the strongest full-HLE predictor. Its 53% landed 1.15pp above the full-HLE actual of 51.85%, vs +5.15pp for sample_100_representative and +3.15pp for sample_200_fresh. Used as the canonical pre-submission validation sample for v8 onward.

The cross-leg ensemble’s same-sample aggregate-reproducibility and the multi-sample bracketing together provide the basis for considering single-run validation sufficient for this submission. A 3-run full-HLE mean would tighten the variance estimate further and is candidate work for the next submission cycle.

Calibration Disclosure

Stated confidence is captured per-Q in the response footer (Confidence: <0-100>%). Calibration buckets from the full 2153-Q canonical-judged submission run (37 entries have no stated_confidence and are excluded from the table; 16 of those 37 were judged correct, accounting for the difference between the bucket-summed 1103 and the headline 1119):

Confidence	n	Correct	Acc	\|bucket-mid − acc\|
90-100	773	530	68.6%	26pp
70-89	950	475	50.0%	30pp
50-69	283	83	29.3%	30pp
30-49	79	14	17.7%	22pp
<30	31	1	3.2%	11pp

Expected Calibration Error (ECE) ≈ 28% (n-weighted across the 2116 bucketed Qs). The ranking is monotonically correct — 90+ confidence answers are 3.8× more accurate than 50-69 answers. The absolute values are systematically inflated by ~22-30pp across the top three buckets. Disclosed as a known limitation; a calibration-aware re-scoring layer is candidate work for the next submission cycle.

Cost

Real cost from Anthropic + OpenAI billing dashboards for the run window: ~$3,500 + ~$25 judge.
Per-Q (real): ~$1.60.
The internal cost-tracker reports approximately 3× this due to a known accumulation bug. The corrected per-run summary line is the source for the dashboard-reconciled number above.

For comparison, typical published agent-with-tools submissions appear to spend $5-$15 per question based on described multi-stage architectures (specialized vision routing, OCR pre-passes, multi-model ensembles per question type). FF-STACK v8 uses a single uniform pipeline with two to three solver legs and one arbiter per question, resulting in materially lower marginal cost.

Limitations

Image questions excluded. This submission is text-only.
Single-run validation. The full-HLE result is from one run. Same-sample aggregate variance is essentially zero on the calibration replay; a multi-run mean on the full set was not pursued. Calibration samples suggest the single-run score sits in the lower portion of the expected variance band; additional runs would better estimate variance and central tendency. A 3-run full-HLE mean is candidate work for the next submission cycle.
Calibration over-confidence. Ranking is correct; absolute confidence values are inflated by ~22-30pp across the top three buckets.
Safety-refusal pattern on biological/chemical content. Approximately 3% of Bio/Chem questions trigger a content-filter refusal at the architecture level. The direct-commit bypass mechanism rescues most of these (validated on multiple Bio refusal-cascade cases). The bypass mechanism’s success rate is bounded by the cross-architectural leg’s knowledge on the same content.
Five unrecovered questions. Out of 2,158 text-only questions, 2,153 saved successfully. Five hit deterministic content-filter refusal patterns that the bypass mechanism could not rescue. These five are scored as incorrect — they remain in the 2,158-question denominator and contribute zero to the 1,119 numerator. Maximum upside if all five had been recoverable and correct: +0.23pp.

What Remains Unpublished

The following are deliberately not specified in this document and remain closed:

Orchestration logic — per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection thresholds, refusal-bypass priority order.
Trigger policies — spiral-detector thresholds, nudge content, self-review gating conditions.
Governance content — the FF-LATTICE prompt text itself.
Full runner configuration — runtime invocation, environment variables, worker-pool tuning, rate-limit settings.

The box-level architecture, the mechanism descriptions, the model stack, the filtering policy, the calibration data, and all empirical results above are publicly disclosed. The methodology is reproducible at the architectural level; the parameter-level tuning is not.

Predictions Files

Three artifacts, three levels of access:

Scrubbed public file — question_id, domain, stated confidence, and per-question judge verdict for the full run. Published on GitHub: github.com/FieldframeLabs/HLE-Text-Run. Reproduces the headline score, the per-domain breakdown, and the calibration table. No model responses, no gold answers, no orchestration telemetry. The repo includes a verify.py that reproduces all three. Scrubbed files for the calibration and holdout sample runs are being added there as well.
Verification file — adds full model responses, token usage, and the per-question judge_response, so the canonical judge can be re-run independently. Internal pipeline telemetry (routing config, mechanism trigger logs) is stripped. Available to leaderboard maintainers for submission verification.
Raw predictions file — full model responses plus complete internal orchestration telemetry. Available on request under reasonable terms.

Repository and Acknowledgments

The FF-STACK framework, the FF-LATTICE governance text, the pipeline orchestration logic, the per-domain routing configuration, and the codified agent (Cade) are closed source. The methodology described in this document, the failure-mode taxonomy, the evidence-tier framework, and the empirical results above are publicly disclosed.

This submission builds on Anthropic’s Claude API, OpenAI’s Chat Completions API, Brave Search, and a number of free domain APIs (Wikipedia, PubChem, UniProt, NCBI, Wolfram Alpha, Wikidata). The HLE benchmark is a collaboration of CAIS and Scale AI. The judge methodology is bit-identical to CAIS’s centerforaisafety/hle/hle_eval/run_judge_results.py.

Submitted to the HLE Leaderboard for Agents with Tools (zoom-ai HuggingFace Space). Scrubbed predictions and a verifier: github.com/FieldframeLabs/HLE-Text-Run.

Contact: edvorochkin@gmail.com