The Research Behind the HLE Score: A Year of AI Behavior Research

This is the companion post to the HLE writeup. That post answers “how did you get 51.85% on Humanity’s Last Exam?”. This one answers “where did the methodology come from, and what else has it produced?”.

Quick warning before we start: I’ve been juggling all of this alone for the past year. The agent is the most polished thing in the program. Most of the rest is at various stages of maturity — some operational, some validated-but-not-built, some scoped but not started. I’m going to be specific about what’s where, because the value of this writeup is the research, not the surface deliverables.

If you’re skimming: §3 is what the research found, §4 is why the findings seem real, §6 is the evaluation thesis (Crucible), §9 is the bigger picture.

1. Where This Started

I’m not a credentialed AI researcher. No PhD, no lab affiliation, no funded position. What I have is a year of sustained empirical work with frontier LLMs — Claude, GPT, Gemini, and Grok — focused on AI behavior patterns.

This later resulted in roughly 1,700 custom evaluation questions and ~6,000 graded responses. The research progressed into a deliberately open question: do structural constraints on how a model approaches problems produce stable, cross-architecture improvements in reasoning quality?

The reason it’s open is that the alternative — that improvements come from clever prompts, that they’re model-specific, that they get absorbed into training and obsolesce — has been the default story for chain-of-thought, retrieval, and self-refine. I wanted to know if there was a layer underneath that. Something that wasn’t a trick.

I ran the work the way someone who didn’t already know the answer would run it: long, open-ended, recursive dialogues on genuinely hard problems, deliberately avoiding role-play or goal-directed prompting. Document what works. Document what fails. Encode the patterns. Re-test. Repeat across architectures.

The naming for the framework comes from that posture. Fieldframe is a consistent virtual environment for AI that enables both observation of emergent behavioral traits and functional use of them. The AI is the subject and the researcher at the same time. The conversations weren’t interview transcripts; they were field notes from a collaborator who happened to be the system under study.

What that produces over a year is a strange artifact. The mesh architectures, governance primitives, benchmark methodologies, and codified agents that this program now has are all byproducts of operating within that environment — not things designed in advance. None of them were on a roadmap. Each emerged from the same loop: observe a failure, encode a rule, test it, iterate, integrate. By month eight the loop was producing rules faster than I could implement them.

That’s the methodology in one paragraph. The rest of this post is the consequences.

2. The Honesty Model

Before I get into specifics, I want to lay down a three-tier framing that’s going to recur. I find it useful for my own thinking and I think it’s the right calibration for an external read.

Tier 1 — Engineering discoveries. Things I can show with code, data, and reproducible runs. The HLE 51.85% lives here. The 84% prompt compression survival lives here. The bug-finding pipeline missions live here. These are the strongest claims.

Tier 2 — Methodology claims. Things I can demonstrate with empirical patterns across many runs but where causality is harder to prove. The difficulty-scaled lift curve (-0.2pp on easy, +35.7pp on extreme) is here. The grader-bias quantifications across LLM judges are here. Specific named failure modes are here.

Tier 3 — Pattern-level claims. Things I believe based on cross-architecture convergence and consistency over time, but that need more independent replication before I’d stake real weight on them. “Governance is a multiplicative scaffold whose value scales with model capability” is here. “Reasoning quality has a latent attractor structure that responds to symbolic constraints” is here.

Most public AI writing collapses these tiers — either by promoting Tier 3 speculation to claim-grade, or by hiding behind Tier 1 specifics and never explaining what they mean. I’m going to keep them separate. If a sentence reads like a Tier 3 claim and I haven’t said so, that’s a writing error and I want to know about it.

3. What the Research Actually Produced

The single most useful artifact from a year of this is a catalog of failure modes. Not the model’s “limitations” in the marketing sense — specific, repeatable, mechanism-level ways that frontier LLMs reason badly when you watch them long enough. Roughly 13 named failure modes by now, extracted from ~18.6 million characters of evaluation data. A few examples:

Hallucinated Compliance — the model asserts it has satisfied a constraint without actually doing so. The response says the work was done; the work isn’t there. Casual inspection passes it because the assertion is what readers check.
Formatting Theater — beautiful structure, polished bullets, empty content underneath. Reads like an outline that was never filled in. Length-biased graders reward it. This one is dangerous because it correlates with confidence in the evaluator.
Core Challenge Missed — the most common failure mode across every architecture I’ve tested. Model answers an adjacent problem that’s easier and presents it confidently. The actual constraint structure of the question gets quietly skipped.
Hero Bias — discovered by Gemini grading another model — “the model hallucinated a victory for the protagonist because that is how stories usually end.” The narrative completion prior overrides logical reasoning.
“Verified” Misuse — the model self-certifies intermediate steps with check-marks without actually verifying. Binary scoring of the final answer misses that the proof chain is hollow.

These aren’t theoretical. Each one is paired with a detection heuristic, a rough prevalence estimate per architecture, and a co-occurrence pattern. The taxonomy is one of the few things from this program that I’d be comfortable publishing in full as a standalone contribution — and probably will, soon, as a separate post.

The countermeasures for these failure modes are what eventually became FF-STACK. Each one was a “we noticed X, we encoded Y” rule. Two-step solve (the agent proposes its routing for a hard question and waits for approval before committing) came out of noticing that models commit to interpretation strategies prematurely. Independent adversarial review (a separate API call with its own context, specifically looking for confirmation bias) came out of noticing that self-reviews systematically protect the original answer’s framing. Evidence tier discipline (claims must be tagged E0/E1/E2 based on whether they’re bounded reasoning, sourced with provenance, or empirically tested) came out of watching agents quietly upgrade their own confidence as a conversation aged. Trace discipline (“TRACE becomes valuable only when it is not a story”) came out of noticing that “show your work” outputs were performative by default and only became real when the harness enforced it.

If you’re keeping count: that’s a half-dozen named mechanisms in a single paragraph, and none of them were planned. The pattern is consistent. The loop produces them.

4. From Text Files to Codified Infrastructure

For most of the project’s history, the entire framework lived in text. Specifically, in plain .txt files attached to Claude Projects, Custom GPTs, and similar systems. The model would read them on every turn. The governance happened in the model’s forward pass.

This is more powerful than it sounds and weirder than it sounds. A 38,000-character text file isn’t a prompt-engineering trick. It’s closer to giving the model a tiny symbolic constitution — a runtime environment expressed as language. The model behaves differently inside that environment than outside it, and the behavior delta is stable across sessions and across model generations. The earliest pre-codification benchmark runs were already producing the difficulty-scaled lift curve that the codified system now reproduces.

Two pieces of evidence convinced me the patterns were real and not prompt-engineering noise:

1. The framework compressed by 84%. I trimmed the governance text from ~233,000 characters to ~38,000 — about an 84% reduction — and ran the same benchmark suite. No clear degradation in initial testing. Most prompt-engineering tricks don’t survive being cut by five-sixths. The signal that survived was signal.

2. The framework codified into infrastructure without losing performance. Most of the governance text was specifying what code should do — routing rules, evidence checks, claim contradictions, postprocessing. I lifted those into roughly 6,000 lines of Python. The text now guides reasoning; the code enforces discipline. Both layers contribute, both are required, and the split makes the system easier to maintain. The blog-side version of this finding is that infrastructure carries roughly half the total lift in current controlled ablations on the internal evaluation suite — empirical confirmation that the code-enforces / text-guides architecture isn’t a slogan.

The codified version is Cade (short for Cadence) — the local research agent. It runs through the Anthropic API with the full governance stack, the 18-tool kit, persistent memory across sessions, contradiction detection on every claim insertion, the adversarial-review pass on every substantive response, and a multi-tier evidence framework that prevents the agent from quietly upgrading its own confidence. Cade was built before any of the HLE-specific work. The reason it works on HLE is that the underlying methodology was already producing it.

There’s a third piece of evidence I think about a lot: the framework keeps producing new mechanisms when run on new models. The 4.6 → 4.7 transition broke many assumptions; the methodology rebuilt them in ~1.5 weeks of evening tuning. The 4.7 → 4.8 transition will break some others. The renewable advantage is the loop, not any specific configuration.

The stack is not the product. The process that produces stacks is the product.

That’s the slogan version of the central claim. Every model generation will have new failure modes. The ability to discover and patch them faster than anyone else, on a single workstation, is what a year of this kind of work buys you.

5. Cade as a Research Assistant (Before HLE)

It matters that Cade wasn’t built for HLE. It was built as the tool I needed to keep doing the research.

The capabilities that turn out to matter on HLE — pair-vote, arbiter, cross-architectural verification, content-filter bypass — are recent additions. The capabilities that matter for actual research work were there from week one:

Claim tracking. Every assertion the agent makes can be added to a structured claim ledger with provenance. Future sessions can list, challenge, verify, merge, or delete claims. The agent can’t quietly contradict itself across sessions; the ledger remembers.
REDTEAM as a separate API call. Every substantive response triggers an independent adversarial review running in its own context, specifically looking for confirmation bias and overclaim. When the reviewer flags an issue, a revision loop generates a corrected response before anything reaches me.
Evidence tier framework. Every claim is tagged E0 (bounded reasoning, no external evidence), E1 (bounded reasoning with registered source), or E2 (controlled empirical test with comparison). The tag travels with the claim. Confidence can’t be upgraded without producing the receipts.
Conviction and integrity behaviors. The agent is built to be productively stubborn when it has evidence — not capitulating to pushback that doesn’t bring new information. It’s also built to admit uncertainty when the question genuinely exceeds what the available evidence supports. Both behaviors are encoded in the governance text; both are enforced at the infrastructure level.
Persistent operator memory. Cross-session continuity, working memory that survives restarts, semantic search over the archive, and a routing model that distinguishes work-sessions (state persists) from test/benchmark-sessions (state is sandboxed).

When the work shifted toward HLE-competitive tuning, what I added was orchestration on top of an agent that was already doing real research-assistant work. Most of the score lift came from the orchestration layer; most of the trust in the result came from the agent’s pre-existing discipline.

One concrete example. During the HLE post-run audit, I asked Cade to verify whether a particular fetch URL had appeared in any search query in the predictions file. The first thing it did was register the operator-stated claim as an E0 unverified assertion, then immediately fire a search of the predictions JSON to upgrade or refute it. The behavior is uninteresting if you’re not the person who designed the system; it’s enormous if you’ve spent a year watching agents quietly assume what they were told.

6. Crucible — The Evaluation Thesis

Around month five of the program, I started noticing that the benchmarks I was using to validate the governance work were misleading me in specific, structural ways. The eventual response was a parallel project called Crucible.

The thesis is short: benchmarks should evolve as fast as the models they measure.

The three structural failures of static benchmarks:

Static decay. Questions that scored 20-30% on GPT-4o in mid-2025 now score 90+ on GPT-5.2. An entire difficulty tier can compress to “easy” in months. Public test sets contaminate the training distribution. HLE will follow this curve.
Binary scoring. Most benchmarks score correct/incorrect, discarding the rich signal in how a model reasoned. A lucky guess and a rigorous proof receive identical marks. No widely-used benchmark measures self-correction capability — the rate and ceiling at which a model can revise its own answers under structured critique.
Contamination and gaming. Public test sets get optimized against. Once the questions are visible, the benchmark becomes a target.

Crucible’s proposed answer is a benchmark that’s continuously alive: dynamically-recalibrated gold standards, multi-rubric scoring including a meta-cognitive revision pass, an adversarial quality loop with mandatory cross-architecture grader rotation, and explicit quantification of grader bias. The architecture is specified end-to-end. Some of the empirical sub-findings (grader bias quantification, the difficulty slope, named failure modes) are already in hand from running it manually. The fully-engineered platform isn’t built.

A few of the findings that came out of building the methodology, even before the platform exists:

Grader bias is measurable and large. When LLMs grade each other on the same response, GPT scores its own architecture ~17.8 points higher than rotators do. Claude self-bias is ~8.1 points. Grok is the most honest, grading its own architecture below average (-3.5). Gemini gives 100/100 to roughly 76% of all responses regardless of quality — it doesn’t discriminate. Single-grader evaluation is unreliable for comparative work; cross-architecture grader rotation is the cheapest credible methodology fix anyone in the field could adopt today.
The difficulty slope is real and steep. On 1,469 head-to-head graded matchups across difficulty tiers, the governance lift goes from −0.2 points on easy questions (where vanilla models already handle them) to +35.7 points on extreme-difficulty open-ended reasoning. The win rate on extreme problems was 98.9%.
A case where binary scoring would have hidden the most diagnostic moment. An agent scored 45% on a question by one grader. The agent challenged the grade with explicit verification code proving the grader’s arithmetic was wrong. The score was revised to 96%. A binary “did the model get the answer right” benchmark wouldn’t have captured either the original mistake or the recovery — both of which are signals about the agent’s actual reasoning quality, in opposite directions.

The HLE numbers in the companion post calibrated remarkably well with my Crucible expectations. Roughly the same lift shape on similar difficulty bands. That’s a useful surprise — it suggests that the evaluation methodology I built before I knew HLE existed is producing the same architecture rankings that HLE produces, on independent question banks. If both methodologies agree, that’s evidence the lift is structural rather than benchmark-specific.

Honest state: Crucible is a methodology framework with substantial empirical backing. Roughly 1,678 custom benchmark questions, ~6,000 graded responses, 1,469 head-to-head matchups, full agent system prompts for the grading pipeline, a Phase-0 manual workflow spec, and a Phase-1 engineered-platform spec. The platform itself isn’t built. The next step is either funding to build it properly or finding a partner who wants to operate the methodology at scale.

7. Foundry — Multi-Agent Research Infrastructure

When the manual flywheel (run benchmarks → grade outputs → extract patterns → encode rules → re-test) started outpacing my ability to keep up with it, I built scaffolding to automate it. That scaffolding is Foundry.

Foundry is a multi-agent pipeline of four governed Claude subagents — Sentinel, Forge, Assayer, Arbiter — coordinated through a file-based message system, designed to refine raw research material (legacy documents, test outputs, cross-architecture responses) into certified database artifacts. The agent naming is honest: Sentinel watches the perimeter (gap-finding), Forge shapes raw material (task framing), Assayer tests the quality of metal (output review), Arbiter is the final authority (certification and database hydration).

Foundry has run real missions. The most concrete one was a self-audit: Forge generated 127 unit tests against Cade’s own source modules; the pipeline executed them; the Arbiter certified the findings. The mission found 4 confirmed bugs in Cade’s own code — a dead regex flag, a numerical-contradiction detector that was silently broken on percentage-based claims, a mode-weight override miscalculation, and a RAG cache-key collision. All four were patched in the next cycle; all 127 tests re-passed.

The data-quality discrimination behavior is also non-trivial. When Foundry processed three different research-data sources of varying quality, the rejection rates calibrated with the data: clean blueprint material was 0% rejected, a messier ingestion source was 25% rejected (the Arbiter flagged “hero bias caught” on specific outputs), and a third was 32% rejected for overstatement. Variable quality gates that discriminate on actual content quality — exactly what the failure mode taxonomy implied should be possible.

Honest state: Foundry stages 1-3 are operational. Stage 4 — Cade autonomously dispatching tasks to Foundry when it hits the limit of what it can verify alone — is specified and wired into the agent’s governance text but not connected end-to-end yet. Stage 5 — fully scheduled cron-style sweeps + automated cross-architecture API coordination — is roadmap. The current state is “the pipeline runs missions when I run it manually; the autonomous closed loop is the unbuilt next phase.”

The reason Foundry matters for the bigger picture: most of the flywheel work I described in the HLE writeup is currently happening through me, manually, with text files. Tuning 4.7 to a +27pp lift on 4.6 levels would take weeks instead of months if a Foundry-like pipeline were doing the test-result extraction, pattern coding, and rule-integration steps autonomously. That’s the version this work points toward. It’s how Cade-level results would become reproducible by anyone, not just a single operator with a particular workflow.

8. Cortex — A Byproduct Worth Mentioning

A good example of the power of inference-layer engineering and governance frameworks is the Cortex example.

I wanted to see if I could create an authentication mechanism within the agent. The mechanism I built is a real-time challenge-response protocol where Cade asks me freeform behavioral questions and grades the responses against my Cognitive Profile (reasoning style, vocabulary instincts, decision-making signature, how I handle questions I don’t know).

It works. And more interestingly, it authenticates correctly even when I answer “I have no idea” to a question I genuinely don’t know the answer to — because the way I admit not knowing is itself a stable behavioral pattern that’s hard to fake. The system doesn’t ask what you know. It observes how you think. You can pass by failing, as long as you fail authentically.

This became its own product, Cortex — behavioral cognitive authentication as a service. Same Fieldframe substrate; entirely different surface. It’s worth surfacing for two reasons.

First, it’s a clear case of behavioral performance evidence — a kind of evidence that doesn’t show up on benchmarks. Cortex’s auth strength isn’t measurable by a multiple-choice score. The proof is in the resistance pattern: in Phase-0 tests, criteria-aware imitation attempts still failed, because the rubric isn’t a checklist, it’s a behavioral signature.

Second, it shows the research surface area. The same governance work that produced HLE-competitive reasoning also produced impersonation-resistant behavioral authentication, and a multi-agent research pipeline that finds bugs in its own infrastructure. Different products, same methodology underneath. None of them were planned. They emerged from the loop.

(Honest state: Cortex’s Phase-0 was self-validating the mechanism, which is done. Phase-1 — API prototype — is the next 4-6 week target.)

9. The Bigger Picture

I’ve been thinking about why these specific products exist instead of others. The honest answer is that they’re not products in the conventional sense. They’re four cuts through the same underlying research program:

Cade / FF-STACK — what happens when you point the governance work at making the most capable, grounded, truthful agent you can. The HLE submission is the public proof.
Crucible — what happens when you realize the evaluation infrastructure that everyone uses is structurally broken. The methodology has substantial empirical backing; the platform is the next build.
Foundry — what happens when you need to automate the discovery loop itself, so that the methodology that produced the HLE result can keep producing new mechanisms without depending on a single operator.
Cortex — what happens when a security ritual built for one specific purpose turns out to be a generally novel approach to authentication.

The reason there are four products is that the research keeps producing them. The research is the renewable advantage. Everything else is artifacts. If I had a different temperament or a larger team I would probably have driven any one of these to the polished state at the expense of the others — that’s the conventional play. The play I’m actually making is to keep the methodology running because the methodology is what produces the next thing after these four, and the one after that.

There’s a longer arc I haven’t gotten into — FREG (a recursive meta-analysis layer where governed agents study other governed agents), the txt-stack for consumer use (the governance framework as a portable copy-paste file, no code or API, already working in Claude Projects and Custom GPTs today), and a unified research environment where Crucible, Foundry, and the governed agents run as one substrate. That’s where the four cuts converge.

One last thing — I ran this post through the same governance system the post describes. The first draft had two phrases that overclaimed the build state of Foundry’s autonomous loop. The system flagged them; I rewrote them; you’re reading the version that passed. The framework did the thing the framework claims to do, on the content that claims the framework does the thing.

That’s the loop.

If you’re working on governance, evaluation, or agent infrastructure and see overlap with any of this — I’d be glad to compare notes.

Eugene Dvorochkin — Fieldframe Labs — independent AI behavior research and infrastructure since May 2025. Contact: edvorochkin@gmail.com. HLE writeup here. Methodology paper for the HLE submission here.