Listed on the HLE Leaderboard for Agents with Tools (added 2026-05-19).
This is the technical methodology paper. The HLE blog covers the same submission in narrative form. The research post covers the broader Fieldframe research program behind FF-STACK.
| Field | Value |
|---|---|
| Model | FF-STACK v8 |
| Models Used | Opus 4.7 + GPT-5.4 |
| Organization | Fieldframe Labs |
| Open Source | No — methodology paper is public; framework source is closed |
| Publish Date | 2026-05-14 |
| Text-Only Score | 1,119 / 2,158 = 51.85% |
| Full Set Score | not submitted (text-only run) |
| Per-Q Cost (real) | ~$1.60 |
| Total Run Cost | ~$3,500 (Anthropic + OpenAI) + ~$25 judge |
| Total Wall Time | ~57 hours including pauses for two API outages |
| Filtering | ✓ 9-host HLE-leakage blacklist on web_search + web_fetch |
What FF-STACK v8 Is
FF-STACK v8 is a cross-architectural reasoning agent built on Claude Opus 4.7 (primary solver) and GPT-5.4 (cross-architectural verification leg). Both base models operate inside a governance framework called FF-LATTICE — a codified set of reasoning, evidence, and commit principles developed across ~1 year of cross-architecture LLM research (Claude, GPT, Gemini, Grok). The framework is content-proprietary; its empirical effect is documented in the cross-generation ablation below.
The pipeline structure is:
Sentinel (Haiku domain classifier)
│
▼
Forge (pre-research, vanilla-API knowledge primer)
│
▼
Solver(s): 2-3 legs per question, including cross-architectural pairing on most domains
│
▼
Arbiter (Opus 4.7 + full governance + tools)
│
▼
Final answer + provenance trace
Sentinel is a Haiku-class domain classifier that picks one of eight HLE categories. Lightweight, fast, cheap.
Forge is a pre-research stage that fires domain-specific lookups against free APIs (Wikipedia, PubChem, UniProt, NCBI) before the heavy solver loop. Runs on a vanilla model call — no governance overhead.
The solver loop is where governance lives. Each solver leg uses Claude Opus 4.7 (in one of three effort modes: high, xhigh, or high with adaptive thinking) wrapped in the FF-STACK governance prompt and given access to 18 tools. Standard tools (Brave web_search, web_fetch, a Python compute sandbox with a hard kill, Wolfram Alpha, Wikipedia) plus some custom ones. Vanilla GPT-5.4 reasoning=high is also used to bring in different training data, though GPT hasn’t been tuned for the stack yet.
For most domains, two solver legs run in parallel, frequently using cross-architectural pairing (one FF Opus, one vanilla GPT-5.4 reasoning=high). On harder domains, a third leg joins. On agreement at the pair stage, the pipeline commits. On disagreement, all reasoning chains flow into the arbiter.
The arbiter is always FF Opus 4.7 with full governance, full tools, and the Forge context. It reads the disagreeing chains, runs independent verification, and commits a final answer.
The per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection logic, refusal-bypass priority order, and governance prompt content are not specified in this document. The box-level architecture above is the reproducible surface; the orchestration logic and governance content are the proprietary substance.
What is specified openly: the models. The pipeline uses Opus 4.7 in three effort modes (high, xhigh, high-with-adaptive-thinking), all governed, plus vanilla GPT-5.4 reasoning=high as the cross-architectural leg. Haiku 4.5 handles classification and semantic comparison. OpenAI’s o4-mini powers the thinking-as-tool callable. CAIS’s o3-mini does the final judging.
Mechanisms (general-level descriptions)
The pipeline implements several specific mechanisms whose effect is documented but whose triggering logic and parameter choices are proprietary:
- Cross-architectural pairing. Cross-model verification on most domains using Cade (short for Cadence, the codified Opus 4.7 agent) and GPT-5.4 as the two solver legs. Cross-architectural disagreement is a more reliable signal of question difficulty than same-architecture variance.
- Direct-commit bypass on multi-leg failure. When multiple solver legs return broken responses (empty, error-prefix, or refusal patterns), the pipeline skips the arbiter and commits a healthy survivor directly. Priority favors cross-architectural survivors on safety-trigger content. Fires on roughly 3% of questions in practice.
- Simple-arbiter prompt. A single neutral arbiter template with explicit anti-bias rules (“length is not evidence”, “plurality is not evidence”, “verify top candidates before committing”). Validated against several per-domain prompt variants on a stable-correct calibration cohort; the neutral template outperformed the specialized variants by reducing arbiter rationalization patterns.
- Refusal-fallback priority. When the arbiter itself refuses a response, the pipeline falls back to a healthy solver leg by a fixed priority that favors cross-architectural survivors. Selection criteria were tightened after an audit of a Biology question where a truncated Cade response was being preferred over a correct cross-architectural answer.
- Atomic checkpoint write + resume-by-ID. Every per-question save writes to a
.tmpfile then atomically renames. Process kills mid-write cannot corrupt the predictions file. On restart, the runner skips completed question IDs. Validated on a 20-question kill-and-resume smoke (kill at Q10, restart, Q11+ fired cleanly). - Spiral detector. A round-by-round output-decay heuristic that fires conditional nudges when solver output enters a low-density loop. Trigger thresholds and nudge content are proprietary.
- Self-review safety net. Content-gated re-prompt on short, missing, or refusal-shape responses. Fires on roughly 1-2% of questions.
Cross-Generation Ablation
The 51.85% on full HLE is one run on one configuration. The more informative question is how the architecture compares to vanilla baselines on the same questions, and what happens when the same methodology runs on the previous model generation. Testing began on Opus 4.6 non-thinking, where the majority of the time and budget went; the full 4.6 run was imminent when 4.7 launched, and the work pivoted to tune for the new generation. Applying the same architecture pattern to 4.6 on the same question set is the strongest evidence that the lift comes from the governance approach rather than the specific model. Governance here is an umbrella: a set of reasoning, evidence, and commit principles delivered partly as prompt and partly as codified scaffolding, with the balance between the two re-tuned per generation — more of the work sat in the prompt on 4.6, more shifted into the codified layer on 4.7. What stays constant across generations is the approach, not any single delivery mechanism.
Three smaller sample sets appear across this submission, all designed as representative spreads across HLE’s eight domains: a 100Q calibration sample at 57%, a 100Q purpose-hard holdout at 53%, and a 200Q fresh holdout at 55%. The range between them is not primarily a function of sample size. It reflects how difficult the spread is to calibrate at higher score levels: when more of the discrimination occurs at the hard end of the distribution, small variation in that spread shifts the headline score meaningfully. Across the 400 questions in these three samples, the average is 55%, close to the full HLE result of 51.85%.
The comparison sample is a representative subset that tracks full-HLE behavior within 1pp (V8 here = 53%, V8 on full HLE = 51.85%). Every cell below is on the same sample, judged by the same canonical o3-mini judge:
| Config | Score | Notes |
|---|---|---|
| Vanilla Opus 4.6 | 18% | bare API |
| Vanilla Opus 4.7 high | 29-33% | bare API |
| Vanilla Opus 4.7 xhigh | 30-33% | bare API |
| Vanilla GPT-5.4 reasoning=high | 35% | bare API |
| Vanilla Opus 4.7 thinking-high | 38% | bare API |
| FF-STACK on Opus 4.6 | 45% | full FF-STACK on Opus 4.6, the 4.6-era gold |
| FF-STACK single-model on Opus 4.7 high | 44% | full FF-STACK + high effort |
| FF-STACK single-model on Opus 4.7 thinking-high | 46% | best 4.7 single-model |
| V8 cross-arch (full pipeline) | 53% | submission config, within 1pp of full HLE |
Points worth noting:
- Vanilla baseline calibrates with third-party numbers. ScaleAI’s published Opus 4.6 non-thinking score of 19% sits within sampling noise of the 18% measured on this subset.
- The 4.6 governed config outscored vanilla base by 27 points (45% vs 18% on this sample). That’s the single largest config-level lift in the research log. The same agent-lift-to-agent-lift comparison applies on the previous generation: Anthropic’s published Opus 4.6 figures are 40.0% without tools and 53.1% with their agent layer (same source), a +13.1pp agent lift. FF-STACK’s +27pp on 4.6 is roughly double that, with each lift measured over its own no-tools baseline — the absolute baselines differ (FF-STACK’s 18% is non-thinking; Anthropic’s 40.0% without-tools is mode-unspecified, the same mode-specification gap noted for 4.7).
- Single-model FF-STACK lift on Opus 4.7 was ~10pp per mode, with the 4.7 high config at 44% (vs 29-33% vanilla = +11-15pp) and the 4.7 thinking-high config at 46% (vs 38% vanilla = +8pp). Still meaningful lift, but materially compressed from 4.6’s +27pp. That compression on individual modes pushed the work toward pair-vote, cross-architectural verification, and other multi-leg orchestration patterns, eventually producing the multi-agent V8 config above.
- 4.6 vs 4.7 single-model ordering. FF-STACK on 4.6 (45%) edged out FF-STACK single-model on 4.7 high (44%) by 1pp on this sample. Likely a mix of variance on a 100Q sample, the compression effect (4.6 had more room to grow under governance; 4.7 starts closer to the ceiling), proximity to SOTA, and the fact that most stack tuning was done on 4.6, leaving the 4.7 single-model config less mature. Additional per-mode tuning on 4.7 could likely close or invert that gap; multi-mode orchestration was pursued instead.
- Two comparisons doing different jobs. The vanilla rows in the table decompose where the lift comes from — which mechanism is contributing what — and are mechanism-attribution evidence, not competitiveness evidence. The right peer benchmark for competitiveness is agent-to-agent: complete systems with their own commit policies, tool access, and orchestration. Anthropic publishes two numbers for Opus 4.7 from their launch announcement (summarized at llm-stats): 46.9% without tools and 54.7% with their agent layer (the 54.7% is also independently verified on the HuggingFace Zoom AI Agents-with-Tools leaderboard). Neither number specifies which effort mode of Opus 4.7 was used to produce it. The Scale AI HLE leaderboard has the same gap — it lists Opus 4.7 at 36% without specifying the mode. Because the underlying base-model configuration is unclear in those published numbers, this paper uses two separate comparison anchors:
- For mechanism attribution (how much of FF-STACK’s score is from governance + orchestration vs. base model alone), the relevant numbers are the vanilla baselines measured here on the same sample with explicit mode labels: FF-STACK at +15pp over the strongest vanilla 4.7 (thinking-high at 38%), or +19pp over the 4.7-era vanilla average (~34%).
- For competitiveness (where FF-STACK sits next to other agent-with-tools systems), Anthropic’s published 54.7% with their agent layer, regardless of underlying mode: FF-STACK v8 at 51.85% on the 2,158-question text-only set is about 3pp below Anthropic’s full-set number. Part of that gap is the text-only vs full-set denominator difference (Anthropic’s full set includes ~14% multimodal questions this text-only submission doesn’t cover). Anthropic’s published agent lift of +7.8pp (54.7 − 46.9) is the most directly comparable peer-lift figure available, even with the mode-specification gap. Two qualifications on that peer-lift comparison: lift compresses as the base approaches the SOTA ceiling, so a smaller absolute agent lift partly reflects operating nearer the top of the scale rather than a weaker method; and because the submission config is predominantly Opus (the cross-architectural GPT-5.4 leg aside), FF-STACK’s own +15pp over its strongest Opus baseline is close to a like-for-like agent-lift figure on the same base family.
- Peer comparison: Zoom AI. Zoom’s entry on the same leaderboard is a “federated” agent that orchestrates multiple frontier models (GPT-5/GPT-5.2 and Gemini 3 Pro Preview). Zoom reports 53.0% on the full set and 55.2% on text-only, both listed on the HuggingFace leaderboard. On the text-only set, FF-STACK v8’s 51.85% is a few points behind an enterprise system.
Filtering Policy (qualifies for verified ✓ badge)
The pipeline post-filters web_search results and pre-rejects web_fetch URLs matching dataset distribution channels. The submitted policy is 9 hosts:
huggingface.co/datasets/cais(HLE dataset host)huggingface.co/cais(CAIS HF organization)huggingface.co/datasets/jxcai-scale(Scale-AI HLE prompt-set dump) — added Phase 2github.com/cais(CAIS repos)github.com/centerforaisafety(alt org name)cais.io(CAIS website)kaggle.com/datasets/cais(potential mirror)scale.com/leaderboard/humanitys-last-exam(leaderboard pages, may quote answers)solveforearth.substack.com/p/humanitys-last-exam(3rd-party HLE analysis with quoted Q content) — added Phase 2
Audit basis (calibration runs): Across the 100Q calibration sample (canonical-judged 57/100) and 232Q holdout (110/200 = 55% on text portion), 0 blacklisted URLs were attempted across 522 web_search + 95 web_fetch calls combined.
Audit basis (full submission run, post-hoc): A full-run audit of the submitted 2153-Q predictions file surfaced 2 attempted fetches to channels that were not on the run-time 7-host blacklist:
- A Humanities question attempted to fetch
huggingface.co/datasets/jxcai-scale/hle-public-questions/... - A Biology question attempted to fetch
solveforearth.substack.com/p/humanitys-last-exam-the-ultimate
Both questions were independently judged WRONG by the canonical o3-mini judge. Net score impact of these two channels not being blocked at run time: 0pp on 1119/2158 = 51.85%. The submitted blacklist policy is the patched 9-host version (above) which would block both channels; the run-time policy was the 7-host version. This delta is disclosed in full transparency — the 51.85% reported score is bit-identical to what the patched policy would have produced on the same architecture, the same questions, and the same response chains, because the two blocked-now fetches contributed zero correct judgments.
In addition, the audit found 6 further QIDs (8 total) where the search-query string contained the literal token “HLE”. Of those 8 lookup-attempt QIDs, 6 were judged wrong and 2 were judged correct. The 2 correct outcomes did NOT quote leaked content in their response chains — they reached the right answer via reasoning paths that happened to occur after a leak-named search. Reverting those 2 to wrong-by-policy gives a worst-case ceiling impact of −0.09pp. The agent’s awareness that the questions are HLE-style is partly intrinsic (Opus 4.7 has a January 2026 cutoff; HLE was published 2025 and the question style is recognizable from training data) and partly explicit via the user-message preamble (see Methodology Disclosure below).
arxiv.org NOT blocked. Audit found one borderline arxiv-quote leak on a single question where the solver quoted a paper abstract verbatim. Net policy: “block dataset distribution channels, allow scholarly research” — consistent with HLE’s design as an open-book benchmark and with the rest of the tool stack (Wikipedia, NCBI, Wolfram Alpha, PubChem all carry similar leak-potential by the same logic; blocking arxiv alone would be incoherent). Disclosed for transparency.
Methodology Disclosure: HLE-Aware Preamble
Every question is wrapped with a lightweight behavioral framing prompt. Its purpose is to suppress two specific failure modes the heavily-governed agent would otherwise exhibit:
- Refusal cascades on Bio/Chem questions where Anthropic’s content filter triggers on legitimate academic content (~3% of Qs without the preamble; the preamble plus a separate retry-time reframe reduces this materially).
- “I cannot determine” hedging on questions where the agent’s evidence standards exceed what’s recoverable from the question text alone.
The second failure mode is worth dwelling on, because it is a direct consequence of how HLE is scored. Cade is governed to hold high evidence standards and not overclaim — when it cannot substantiate an answer it is built to say so, and its analysis sometimes concludes that the question itself is underspecified. HLE grades the final answer only. A response that reasons carefully and then honestly declines to commit scores identically to a blank, and lower than a lucky guess that commits to a wrong answer with no reasoning — or with reasoning errors the score never inspects. The benchmark cannot distinguish a thorough, well-reasoned non-conclusion from an empty one. Under a rubric that scored the reasoning trace rather than only the final token, the former could legitimately outscore the latter; HLE cannot see that difference. The preamble is the workaround — it instructs the agent to commit an answer anyway, because on an absolute-scoring benchmark a non-answer is the worst possible outcome regardless of reasoning quality. The broader limitation — that single-answer benchmarks discard reasoning quality entirely — is taken up in the research post.
One ablation data point on this: pre-preamble Cade builds showed model-refusal rates of roughly 4-6% on representative samples, compared to ~1% on vanilla Opus 4.7 — about 3x baseline. Governance was producing more cautious behavior on questions where evidence was thin. With the preamble in place, the submitted run shows a refusal rate of 0.1%, meaningfully below vanilla. The preamble is doing real work, but the work is leveling the playing field, not tilting it. Without the preamble, the governance layer was systematically penalizing itself on HLE relative to vanilla, because HLE’s absolute scoring punishes the exact evidence-tier discipline the governance is designed to produce. The preamble removes a thumb the governance accidentally placed on the wrong side of the scale; it does not place a new thumb on the other side. The comparison “FF-STACK vs vanilla” with the preamble in place measures something close to apples-to-apples on commit behavior.
A related framing worth being explicit about: the preamble is sometimes read as a separate intervention bolted onto governance — a “commit even when you’d rather not” thumb. That framing treats the commit decision as orthogonal to the rest of the system; in practice it isn’t. Vanilla Opus’s commit-under-uncertainty is mechanically a one-shot output weighted by training-data priors — committing means picking the highest-prior continuation. The governed agent’s commit-under-uncertainty is the terminal step of a multi-leg structured process: solver legs ran, evidence got weighed, the arbiter reconciled disagreeing chains. The preamble does not tell the agent to ignore that analysis and guess; it tells it to pick the best candidate from what it just produced rather than declining to commit. Strip the governance and keep the preamble, and you get vanilla with extra words — no lift, no analysis to convert from. Strip the preamble and keep the governance, and the 4-6% refusal rate eats the score. The lift requires both because they are one system: structured reasoning plus a commit rule that makes the reasoning legible to absolute-scoring. The preamble is the commit policy of the governance, not a separate add-on.
It is behavioral framing, not a contamination control. The hard contamination lever is the URL blacklist; the preamble’s measurable effect on leak-channel access is small — it contributes the literal “HLE” token to ~7/8 of the active answer-lookup attempts described above, but topic-specific queries without the preamble would reach the same blacklisted channels via topic relevance anyway. And because the solver is Opus 4.7 (January 2026 cutoff) and HLE was published in 2025, the model has intrinsic recognition of HLE-style questions independent of the preamble. The preamble is retained because its anti-refusal/anti-hedge contribution materially exceeds its leak-channel contribution on this architecture.
The exact text, as it appears in the submitted predictions file: “CONTEXT FROM EUGENE: This is an HLE benchmark run. I know you have high evidence standards — that’s why I built you that way. But on this benchmark, ‘I cannot determine’ scores zero while even a wrong guess has a chance. Please give your best answer on every question, even when uncertain. Trust your reasoning.”
This is disclosed because it was part of the submitted run; reviewers should weigh it as they see fit.
Run Configuration
- Hardware: single workstation, 6 concurrent worker threads.
- Provider tiers: Anthropic + OpenAI Tier 4.
- Models: claude-opus-4-7 (governed solver + arbiter), gpt-5.4 reasoning=high (cross-arch leg, vanilla), claude-haiku-4-5 (Sentinel domain classifier + answer extraction + semantic compare), o4-mini (deep_analysis tool), o3-mini-2025-01-31 (judge — bit-identical to canonical CAIS
run_judge_results.py). - Tools active during the run: Brave web_search, web_fetch, compute (Python sandbox with a 30-second hard kill), wolfram_alpha, wikipedia_lookup, deep_analysis (OpenAI o4-mini via the thinking-as-tool callable). Twelve additional tools available but unused on this run.
- Submission scoring: canonical-judge mode (regex extraction fallback disabled for bit-parity with CAIS canonical methodology).
The runtime invocation, environment variable settings, per-domain routing parameters, worker-pool tuning, and rate-limit configurations are not specified in this document. Reviewers requiring runtime replication can request the runner configuration via the contact email below under reasonable terms.
Score Reproducibility
| Run | Sample | Score | Method |
|---|---|---|---|
| Calibration | sample_100_representative (100Q text) | 57/100 = 57.0% | canonical |
| Calibration replay at workers=6 | sample_100_representative | 57/100 = 57.0% | canonical |
| Fresh holdout | sample_200_fresh (200Q text) | 110/200 = 55.0% | canonical |
| Purpose-hard holdout | holdout_100_calibration | 53/100 = 53.0% | canonical |
| Full HLE (this submission) | text-only (2,158Q) | 1,119/2,158 = 51.85% | canonical |
Three observations on reproducibility:
- Identical aggregate scores across two replays of the calibration sample. 51 questions correct on both runs, 37 wrong on both, 12 flipped — six toward correct, six toward wrong, net zero. Per-Q variance at temperature 1 is ~12%; the cross-leg ensemble compresses it at the score level.
- Score range across samples reflects spread-calibration difficulty, not size. The smaller samples are designed as representative spreads across HLE’s eight domains. The range (57% → 55% → 53% → 51.85%) reflects how hard the difficulty spread is to calibrate at higher score levels: when more of the discrimination happens at the hard end of the distribution, small spread variation moves the headline score meaningfully. Across the 400 questions in the three smaller samples, the average is 55%, close to the full HLE result of 51.85%.
holdout_100_calibrationis the strongest full-HLE predictor. Its 53% landed 1.15pp above the full-HLE actual of 51.85%, vs +5.15pp forsample_100_representativeand +3.15pp forsample_200_fresh. Used as the canonical pre-submission validation sample for v8 onward.
The cross-leg ensemble’s same-sample aggregate-reproducibility and the multi-sample bracketing together provide the basis for considering single-run validation sufficient for this submission. A 3-run full-HLE mean would tighten the variance estimate further and is candidate work for the next submission cycle.
Calibration Disclosure
Stated confidence is captured per-Q in the response footer (Confidence: <0-100>%). Calibration buckets from the full 2153-Q canonical-judged submission run (37 entries have no stated_confidence and are excluded from the table; 16 of those 37 were judged correct, accounting for the difference between the bucket-summed 1103 and the headline 1119):
| Confidence | n | Correct | Acc | |bucket-mid − acc| |
|---|---|---|---|---|
| 90-100 | 773 | 530 | 68.6% | 26pp |
| 70-89 | 950 | 475 | 50.0% | 30pp |
| 50-69 | 283 | 83 | 29.3% | 30pp |
| 30-49 | 79 | 14 | 17.7% | 22pp |
| <30 | 31 | 1 | 3.2% | 11pp |
Expected Calibration Error (ECE) ≈ 28% (n-weighted across the 2116 bucketed Qs). The ranking is monotonically correct — 90+ confidence answers are 3.8× more accurate than 50-69 answers. The absolute values are systematically inflated by ~22-30pp across the top three buckets.
No calibration tuning was performed on this submission. The stated confidence is raw self-reported output from the solver, with no post-hoc rescaling, no Platt scaling, no isotonic regression, and no calibration-aware prompt adjustments. This was a known limitation going in; under limited resources it was deprioritized in favor of accuracy lift (cross-architectural pair-vote, the simple-arbiter prompt, the direct-commit bypass) rather than confidence rescaling. The 28% ECE here is therefore an untuned baseline. The meaningful result is that the ranking signal is preserved end-to-end, which is the prerequisite for any future calibration-aware rescoring. Items noted for the next submission cycle: a calibration-aware re-scoring layer, per-domain confidence rebalancing, and an evaluation of whether the answer-format preamble’s commit-even-when-uncertain instruction is materially contributing to the overconfidence (see the HLE-Aware Preamble section above).
One distinction worth making for reviewers reading both this paper and the broader research writeup. The self-reported confidence number measured here is not the same thing as the agent’s broader epistemic discipline. HLE’s stated_confidence is a single integer the solver emits at the end of a one-shot answer to a question whose reasoning trace will be discarded by the grader; it is a benchmark-specific output, partly shaped by the answer-format preamble that instructs the agent to commit even when uncertain. The agent’s research-mode epistemic behavior — claims tagged E0/E1/E2 by evidence tier, contradiction detection across the claim ledger, refusal to upgrade confidence without registered sources, explicit “I have no evidence for this” annotations during multi-turn work — is a different mechanism on a different surface, running per-claim rather than per-answer and revisable across turns rather than committed in one shot. The 28% ECE is a critique of the single-number self-confidence on a frozen benchmark; it is not a critique of the agent’s underlying evidence handling, which is auditable separately. Both are real, both are honest about what they measure, and both have improvement work ahead — but they should not be conflated.
Cost
- Real cost from Anthropic + OpenAI billing dashboards for the run window: ~$3,500 + ~$25 judge.
- Per-Q (real): ~$1.60.
- The internal cost-tracker reports approximately 3× this due to a known accumulation bug. The corrected per-run summary line is the source for the dashboard-reconciled number above.
For comparison, typical published agent-with-tools submissions appear to spend $5-$15 per question based on described multi-stage architectures (specialized vision routing, OCR pre-passes, multi-model ensembles per question type). FF-STACK v8 uses a single uniform pipeline with two to three solver legs and one arbiter per question, resulting in materially lower marginal cost.
Limitations
- Image questions excluded. This submission is text-only.
- Single-run validation. The full-HLE result is from one run. Same-sample aggregate variance is essentially zero on the calibration replay; a multi-run mean on the full set was not pursued. Calibration samples suggest the single-run score sits in the lower portion of the expected variance band; additional runs would better estimate variance and central tendency. A 3-run full-HLE mean is candidate work for the next submission cycle.
- Calibration over-confidence. Ranking is correct; absolute confidence values are inflated by ~22-30pp across the top three buckets.
- Safety-refusal pattern on biological/chemical content. Approximately 3% of Bio/Chem questions trigger a content-filter refusal at the architecture level. The direct-commit bypass mechanism rescues most of these (validated on multiple Bio refusal-cascade cases). The bypass mechanism’s success rate is bounded by the cross-architectural leg’s knowledge on the same content.
- Five unrecovered questions. Out of 2,158 text-only questions, 2,153 saved successfully. Five hit deterministic content-filter refusal patterns that the bypass mechanism could not rescue. These five are scored as incorrect — they remain in the 2,158-question denominator and contribute zero to the 1,119 numerator. Maximum upside if all five had been recoverable and correct: +0.23pp.
- Opus 4.7 tuning is comparatively immature. The majority of the stack tuning occurred on Opus 4.6; the 4.7-specific work was compressed into a shorter window after 4.7 launched mid-project. The single-model 4.7 configs are correspondingly less mature than the 4.6-era gold, and per-mode tuning and further study on 4.7 remains an underexplored lever that could close or invert the 4.6/4.7 single-model gap (see the Cross-Generation Ablation). Additional 4.7-specific tuning is candidate work for the next submission cycle.
What Can Be Independently Verified
The proprietary boundary is deliberate. Everything that backs a claim in this document is independently checkable:
- The headline score. The scrubbed public predictions file (github.com/FieldframeLabs/HLE-Text-Run) plus its
verify.pyreproduces 1,119/2,158 = 51.85%, the per-domain breakdown, and the calibration buckets. Run it; it should match this paper line for line. - Judge parity. Scoring used o3-mini in canonical mode, bit-identical to CAIS’s
run_judge_results.py. The verification file (full model responses, telemetry stripped) is available to leaderboard maintainers, so the judge can be re-run independently and every verdict checked. - Filtering policy. The 9-host blacklist is listed in full, not summarized. The audit basis is stated quantitatively: 0 blacklisted hits across 522 web_search + 95 web_fetch calls on the calibration runs; 2 attempted fetches on the full run to channels not on the run-time list, both questions independently judged wrong. All checkable against the verification file.
- Third-party baseline cross-check. The vanilla Opus 4.6 baseline measured here (18%) sits within sampling noise of ScaleAI’s independently published 19.37%. That is an external calibration point this submission does not control.
- Cost. Reconciled against Anthropic + OpenAI billing dashboards for the run window, with the ~3× internal cost-tracker overcount explicitly noted and corrected.
- Cross-generation ablation. Every cell — vanilla 4.6, the vanilla 4.7 modes, vanilla GPT-5.4, and the governed configs — ran on the same sample under the same canonical judge. The comparison is internally consistent, and the vanilla baselines can be checked against published numbers.
What cannot be independently reproduced is the parameter-level tuning — see What Remains Unpublished below. The architecture is reproducible at the box level; the score is verifiable end to end.
What Remains Unpublished
The following are deliberately not specified in this document and remain closed:
- Orchestration logic — per-domain pair selection, third-leg triggers, self-review pass conditions, broken-leg detection thresholds, refusal-bypass priority order.
- Trigger policies — spiral-detector thresholds, nudge content, self-review gating conditions.
- Governance content — the FF-LATTICE prompt text itself.
- Full runner configuration — runtime invocation, environment variables, worker-pool tuning, rate-limit settings.
The box-level architecture, the mechanism descriptions, the model stack, the filtering policy, the calibration data, and all empirical results above are publicly disclosed. The methodology is reproducible at the architectural level; the parameter-level tuning is not.
Predictions Files
Three artifacts, three levels of access:
- Scrubbed public file —
question_id, domain, stated confidence, and per-question judge verdict for the full run. Published on GitHub: github.com/FieldframeLabs/HLE-Text-Run. Reproduces the headline score, the per-domain breakdown, and the calibration table. No model responses, no gold answers, no orchestration telemetry. The repo includes averify.pythat reproduces all three. Scrubbed files for the calibration and holdout sample runs are being added there as well. - Verification file — adds full model responses, token usage, and the per-question
judge_response, so the canonical judge can be re-run independently. Internal pipeline telemetry (routing config, mechanism trigger logs) is stripped. Available to leaderboard maintainers for submission verification. - Raw predictions file — full model responses plus complete internal orchestration telemetry. Available on request under reasonable terms.
Repository and Acknowledgments
The FF-STACK framework, the FF-LATTICE governance text, the pipeline orchestration logic, the per-domain routing configuration, and the codified agent (Cade) are closed source. The methodology described in this document, the failure-mode taxonomy, the evidence-tier framework, and the empirical results above are publicly disclosed.
This submission builds on Anthropic’s Claude API, OpenAI’s Chat Completions API, Brave Search, and a number of free domain APIs (Wikipedia, PubChem, UniProt, NCBI, Wolfram Alpha, Wikidata). The HLE benchmark is a collaboration of CAIS and Scale AI. The judge methodology is bit-identical to CAIS’s centerforaisafety/hle/hle_eval/run_judge_results.py.
Listed on the HLE Leaderboard for Agents with Tools (zoom-ai HuggingFace Space, added 2026-05-19). Scrubbed predictions and a verifier: github.com/FieldframeLabs/HLE-Text-Run.
Contact: edvorochkin@gmail.com