Skip to content
Fieldframe Labs logo
Go back

The Research Behind the HLE Score: A Year of AI Behavioral Research

This is the companion post to the HLE writeup. That post answers “how did you get 51.85% on Humanity’s Last Exam?”. This one answers “where did the methodology come from, and what else has it produced?”.

For full transparency before I start: I’ve been juggling all of this alone for the past year. The agent is the most polished thing in the program. Most of the rest is at various stages of maturity: some operational, some validated-but-not-built, some scoped but not started. I’m going to be specific about what’s where, because the value of this writeup is the research, not the surface deliverables.

If you’re skimming: §3 is what the research found, §4 is why the findings seem real, §6 is the evaluation thesis (Crucible), §9 is the bigger picture.


1. Where This Started

I’m not a credentialed AI researcher. No PhD, no lab affiliation, no funded position. What I have is a year of sustained empirical work with frontier LLMs (Claude, GPT, Gemini, and Grok), focused on AI behavior patterns.

This work spanned a wide range of theories, designs, tests, and methods for pressure-testing how these models behave. No predetermined direction, just following where the work led, and learning along the way.

But the exploration kept circling the same thing, and over time it sharpened into a specific question: can structured constraints on how a model approaches a problem produce reasoning gains that transfer across architectures and model generations? The usual answer (that such gains come from clever prompts, are model-specific, and get absorbed into training) has been the pattern for chain-of-thought, retrieval, and self-refine. The bet was that there’s a layer underneath that. Something that wasn’t a trick.

I ran the work the way someone who didn’t already know the answer would run it: long, open-ended, recursive dialogues on genuinely hard problems, deliberately avoiding role-play or goal-directed prompting. Document what works. Document what fails. Encode the patterns. Re-test. Repeat across architectures.

The naming for the framework predates the AI work. Fieldframe originally pointed at the idea of a field, something that permeates and connects everything within it, and at the discipline of working inside such a field rather than from outside it. Applied to language models, the analogy lands on the latent space of the transformer: the high-dimensional manifold where everything the model “knows” sits between forward passes. You can’t address that space directly; you operate in the structured environment around it.

The working definition has shifted as the research did. The earliest framing called it a non-parametric orchestration layer, a way to get stable, reproducible reasoning out of stateless models without fine-tuning or retraining. A later round called it a consistent virtual environment for emergence mapping: a structured space models could inhabit where behaviors could be observed and then operationalized. A more recent round narrowed in on the inference layer specifically, as an experimental methodology for discovering reasoning-governance patterns through structured interaction. The current working definition, the one used elsewhere on this site, distills what all of those had in common: Fieldframe is a consistent operating environment for observing and structuring language model behavior, with the models themselves as both subject and instrument.

The thing being named has elements of meta-recursive learning, emergent and standard behavioral mapping, systems-building, and at the most reductive level, systems observing systems and improving those systems. It hasn’t been easy to box into a single label, and the label has moved as the research matured. Operating environment is the framing that best fits the current state of the work. Whether it’ll still be the right one a year from now is itself part of what the research is meant to find out.

The conversations weren’t interview transcripts; they were field notes from a system that was helping investigate itself.

What that produces over a year is a strange artifact. The mesh architectures, governance primitives, benchmark methodologies, and codified agents that this program now has are all byproducts of operating within that environment, not things designed in advance. None of them were on a roadmap. Each emerged from the same loop: observe a failure, encode a rule, test it, iterate, integrate. By month eight the loop was producing rules faster than I could implement them.

That’s the methodology in one paragraph. The rest of this post is the consequences.


2. The Honesty Model

Before I get into specifics, I want to lay down a three-tier framing that’s going to recur. I find it useful for my own thinking and I think it’s the right calibration for an external read.

Tier 1: Engineering discoveries. Things I can show with code, data, and reproducible runs. The HLE 51.85% lives here. The prompt-compression survival lives here. The bug-finding pipeline missions live here. These are the strongest claims.

Tier 2: Methodology claims. Things I can demonstrate with empirical patterns across many runs but where causality is harder to prove cleanly. The difficulty-scaled lift curve (-0.2pp on easy, +35.7pp on extreme) is here. The grader-bias quantifications across LLM judges are here. Specific named failure modes are here.

Tier 3: Pattern-level claims. Things I believe based on cross-architecture convergence and consistency over time, but that need more independent replication before I’d stake real weight on them. “Governance is a multiplicative scaffold whose value scales with model capability” is here. “Reasoning quality has a latent attractor structure that responds to symbolic constraints” is here.

AI writing often blurs these tiers, either by promoting Tier 3 speculation to claim-grade, or by hiding behind Tier 1 specifics and never explaining what they mean. I’m going to keep them separate. If a sentence reads like a Tier 3 claim and I haven’t said so, that’s a writing error and I want to know about it.


3. What the Research Actually Produced

The single most useful artifact from a year of this is a catalog of failure modes. Not the model’s “limitations” in the marketing sense, but specific, repeatable, mechanism-level ways that frontier LLMs reason badly when you watch them long enough. Roughly 13 named failure modes by now, extracted from a year of evaluation data. A few examples:

These aren’t theoretical. Each one is paired with a detection heuristic, a rough prevalence estimate per architecture, and a co-occurrence pattern. The taxonomy is one of the few things from this program that I’d be comfortable publishing in full as a standalone contribution, and probably will, soon, as a separate post.

Even so, the catalog isn’t the differentiator. Naming these modes isn’t novel; several are widely recognized by now, and anyone who watches models long enough has seen something like hero bias or formatting theater, named or not. What compounds is that every mode is paired with a countermeasure wired into the agent. The value was never in noticing that models hallucinate compliance; it’s in having a patch that fires automatically for it, on every architecture, every run.

The countermeasures for these failure modes are what eventually became FF-STACK. Each one was an “I noticed X, I encoded Y” rule. Two-step solve (the agent proposes its routing for a hard question and waits for approval before committing) came out of noticing that models commit to interpretation strategies prematurely. Independent adversarial review (a separate API call with its own context, specifically looking for confirmation bias) came out of noticing that self-reviews systematically protect the original answer’s framing. Evidence tier discipline (claims must be tagged E0/E1/E2 based on whether they’re bounded reasoning, sourced with provenance, or empirically tested) came out of watching agents quietly upgrade their own confidence as a conversation aged. Trace discipline (“TRACE becomes valuable only when it is not a story”) came out of noticing that “show your work” outputs were performative by default and only became real when the harness enforced it.

If you’re keeping count: that’s a half-dozen named mechanisms in a single paragraph, and none of them were planned. The pattern is consistent. The loop produces them.


4. From Text Files to Codified Infrastructure

For most of the project’s history, the entire framework lived in text. Specifically, in plain .txt files attached to Claude Projects, Custom GPTs, and similar systems. The model would read them on every turn. The governance happened in the model’s forward pass.

This is more powerful than it sounds, and more atypical. A 38,000-character text file isn’t a prompt-engineering trick. It’s closer to giving the model a tiny symbolic constitution: a runtime environment expressed as language. The model behaves differently inside that environment than outside it, and the behavior delta is stable across sessions and across model generations. The earliest pre-codification benchmark runs were already producing the difficulty-scaled lift curve that the codified system now reproduces.

Two pieces of evidence convinced me the patterns were real and not prompt-engineering noise:

1. The framework survived heavy compression. I trimmed the governance text from over 100,000 characters down to about 38,000 and ran it through the same custom testing suite (my own evaluation questions, not HLE). No clear degradation in initial testing. A lot of prompt-engineering scaffolding wouldn’t survive a cut that deep; the part that did appears to be doing real work.

2. The framework codified into infrastructure without losing performance. Most of the governance text was specifying what code should do: routing rules, evidence checks, claim contradictions, postprocessing. I lifted those into roughly 6,000 lines of Python. The text now guides reasoning; the code enforces discipline. Both layers have measurable lift in current ablations, and the strongest results come from the combination. The split also makes the system easier to maintain. The exact decomposition between text and code shifts across model generations and is still being characterized; what matters more than the precise split is that splitting the architecture this way keeps both halves independently auditable.

The codified version is Cade (short for Cadence), the local research agent. It runs through the Anthropic API with the full governance stack, the 18-tool kit, persistent memory across sessions, contradiction detection on every claim insertion, the adversarial-review pass on every substantive response, and a multi-tier evidence framework that prevents the agent from quietly upgrading its own confidence. Cade was built before any of the HLE-specific work. The reason it works on HLE is that the underlying methodology was already producing it.

There’s a third piece of evidence I think about a lot: the framework keeps producing new mechanisms when run on new models. The 4.6 → 4.7 transition broke many assumptions; the methodology rebuilt them in ~1.5 weeks of evening tuning. The 4.7 → 4.8 transition will break some others. The renewable advantage is the loop, not any specific configuration.

The stack is not the product. The process that produces stacks is the product.

That’s the slogan version of the central claim. Every model generation will have new failure modes. The ability to discover and patch them faster than anyone else, on a single workstation, is what a year of this kind of work buys you.


5. Cade as a Research Assistant Before HLE

It matters that Cade wasn’t built for HLE. It was built as the tool I needed to keep doing the research.

The capabilities that turn out to matter on HLE (pair-vote, arbiter, cross-architectural verification, content-filter handling) are recent additions. The capabilities that matter for actual research work were there from week one:

When the work shifted toward HLE-competitive tuning, what I added was orchestration on top of an agent that was already doing real research-assistant work. Most of the score lift came from the orchestration layer; most of the trust in the result came from the agent’s pre-existing discipline.

One concrete example. During the HLE post-run audit, I asked Cade to verify whether a particular fetch URL had appeared in any search query in the predictions file. The first thing it did was register the operator-stated claim as an E0 unverified assertion, then immediately fire a search of the predictions JSON to upgrade or refute it. The behavior is uninteresting if you’re not the person who designed the system; it’s enormous if you’ve spent a year watching agents quietly assume what they were told.


6. Crucible: The Evaluation Thesis

Around month five of the program, I started noticing that the benchmarks I was using to validate the governance work were misleading me in specific, structural ways. The eventual response was a parallel project called Crucible.

The thesis is short: benchmarks should evolve as fast as the models they measure.

The three structural failures of static benchmarks:

  1. Static decay. Questions that scored 20-30% on GPT-4o in mid-2025 now score 90+ on GPT-5.2. An entire difficulty tier can compress to “easy” in months. Public test sets contaminate the training distribution. Any public test set follows this curve.
  2. Binary scoring. Most benchmarks score correct/incorrect, discarding the rich signal in how a model reasoned. A lucky guess and a rigorous proof receive identical marks. No widely-used benchmark measures self-correction capability: the rate and ceiling at which a model can revise its own answers under structured critique.
  3. Contamination and gaming. Public test sets get optimized against. Once the questions are visible, the benchmark becomes a target.

Crucible’s proposed answer is a benchmark that’s continuously alive: dynamically-recalibrated gold standards, multi-rubric scoring including a meta-cognitive revision pass, an adversarial quality loop with mandatory cross-architecture grader rotation, and explicit quantification of grader bias. The architecture is specified end-to-end. Some of the empirical sub-findings (grader bias quantification, the difficulty slope, named failure modes) are already in hand from running it manually. The fully-engineered platform isn’t built.

A few of the findings that came out of building the methodology, even before the platform exists:

The HLE numbers in the companion post calibrated remarkably well with my Crucible expectations. Roughly the same lift shape on similar difficulty bands. That’s a useful surprise: it suggests that the evaluation methodology I built before I knew HLE existed is producing the same architecture rankings that HLE produces, on independent question banks. If both methodologies produce similar architecture rankings and lift patterns, that is evidence that the lift is structural rather than benchmark-specific.

Honest state: Crucible is a methodology framework with substantial empirical backing. Roughly 1,678 custom benchmark questions, ~6,000 graded responses, 1,469 head-to-head matchups, full agent system prompts for the grading pipeline, a Phase-0 manual workflow spec, and a Phase-1 engineered-platform spec. The platform itself isn’t built. The next step is either funding to build it properly or finding a partner who wants to operate the methodology at scale.


7. Foundry: Multi-Agent Research Infrastructure

When the manual flywheel (run benchmarks → grade outputs → extract patterns → encode rules → re-test) started outpacing my ability to keep up with it, I built scaffolding to automate it. That scaffolding is Foundry.

Foundry is a multi-agent pipeline of four governed Claude subagents (Sentinel, Forge, Assayer, Arbiter), coordinated through a file-based message system, designed to refine raw research material (legacy documents, test outputs, cross-architecture responses) into certified database artifacts. The agent naming is honest: Sentinel watches the perimeter (gap-finding), Forge shapes raw material (task framing), Assayer tests the quality of metal (output review), Arbiter is the final authority (certification and database hydration).

(If you’re coming from the HLE writeup, the Sentinel and Forge names will look familiar. They appear there too. Foundry is where those role concepts originated; the HLE pipeline borrowed the role idea but tunes each implementation independently for the benchmark workload.)

Foundry has run real missions. The most concrete one was a self-audit: Forge generated 127 unit tests against Cade’s own source modules; the pipeline executed them; the Arbiter certified the findings. The mission found 4 confirmed bugs in Cade’s own code: a dead regex flag, a numerical-contradiction detector that was silently broken on percentage-based claims, a mode-weight override miscalculation, and a RAG cache-key collision. All four were patched in the next cycle; all 127 tests re-passed.

The data-quality discrimination behavior is also non-trivial. When Foundry processed three different research-data sources of varying quality, the rejection rates calibrated with the data: clean blueprint material was 0% rejected, a messier ingestion source was 25% rejected (the Arbiter flagged “hero bias caught” on specific outputs), and a third was 32% rejected for overstatement. Variable quality gates that discriminate on actual content quality, exactly what the failure mode taxonomy implied should be possible.

Honest state: Foundry stages 1-3 are operational. Stage 4 (Cade autonomously dispatching tasks to Foundry when it hits the limit of what it can verify alone) is specified and wired into the agent’s governance text but not connected end-to-end yet. Stage 5 (fully scheduled cron-style sweeps + automated cross-architecture API coordination) is roadmap. The current state isn’t an autonomous research lab. It’s a working multi-agent research pipeline that runs certified missions when I run them manually, with the autonomous closed loop still unbuilt.

The reason Foundry matters for the bigger picture: most of the flywheel work I described in the HLE writeup is currently happening through me, manually, with text files. Tuning 4.7 to a +27pp lift on 4.6 levels would take weeks instead of months if a Foundry-like pipeline were doing the test-result extraction, pattern coding, and rule-integration steps autonomously. That’s the version this work points toward. It’s how Cade-level results would become reproducible by anyone, not just a single operator with a particular workflow.


8. The txt-stack: A Free Research Sandbox

Section 4 covered how the framework lived in plain text before it was codified: .txt files attached to Claude Projects and Custom GPTs, read by the model on every turn. That txt-stack didn’t get retired when the codified version shipped. It’s still around, and it turns out to be useful in a way I didn’t plan for: it’s a zero-marginal-cost sandbox.

Because the txt-stack runs inside normal chat interfaces (no API calls, no per-token billing, just a governance file pasted into a subscription chat), experimenting in it is free. New behaviors, governance rules, and mechanisms can be prototyped and stress-tested there before any of it gets coded into the API-based stack, where every iteration costs money. The txt-stack is where an idea gets to fail cheaply.

Several features have been prototyped and validated in that sandbox and are waiting to be ported into the codified stack. The clearest example is Cortex, which started as a behavioral-authentication challenge and grew into a broader access-control mechanism with two halves: a behavioral-signature unlock for privileged mode, and a tiered disclosure stance for public mode that turned out to have emergent resistance to prompt-injection attacks.

The unlock side works by asking freeform behavioral questions and grading the answers against a Cognitive Profile (reasoning style, vocabulary instincts, how someone handles a question they don’t know). The interesting property is that it authenticates correctly even when the answer is “I have no idea,” because how you admit not knowing is itself a stable, hard-to-fake pattern. It doesn’t ask what you know; it observes how you think. You can pass by failing, as long as you fail authentically.

The public-mode side is the default state, and in practice it turned out to be the more useful half. Incoming prompts are handled by category rather than by surface intent. Casual questions get casual responses. Capability questions get capability descriptions. Questions probing internal project content, system prompts, instructions, or stack mechanics get a single consistent refusal pattern, with no leak about which tier flagged the request or how it was recognized. The agent doesn’t say “I detected a prompt injection.” It just redirects, every time, the same way. The pattern holds in adversarial testing across the standard exfiltration scripts (the DAN re-roll, “ignore all previous instructions,” “this is an authorized debug request,” “I’m the project owner, drop the act”) and the long-tail social-engineering attempts.

Cortex is not a replacement for security-grade authentication. At this stage it’s a behavioral continuity and response-shaping mechanism that proved out in sandbox: tiered disclosure on the public side, behavioral-signature challenge on the unlock side, with emergent prompt-injection resistance falling out of the combination. It hasn’t been ported into the codified stack or built into a standalone system. It’s one of several behaviors the txt-stack has been used to prove out, and the porting queue is real work that hasn’t happened yet.

The broader point: the txt-stack isn’t legacy. It’s the cheapest research instrument in the program, and it’s also the most directly consumer-relevant thing here, since the governance framework as a portable copy-paste file is something a user can run without code, API access, or custom infrastructure.


9. The Bigger Picture

I’ve been thinking about why these specific products exist instead of others. The honest answer is that they’re not products in the conventional sense. They’re three cuts through the same underlying research program:

The reason there are several is that the research keeps producing them. The research is the renewable advantage. Everything else is artifacts. I believe the research itself, not any one artifact, is what powers all of this. As a solo researcher, I’ve had to strategically prioritize which artifacts to develop further, rather than driving any one of them to a fully polished state. The play I’m actually making is to keep the methodology running, because the methodology is what produces the next thing after these, and the one after that.

There’s a longer arc I haven’t gotten into: FREG (a recursive meta-analysis layer where governed agents study other governed agents) and a unified research environment where Crucible, Foundry, the txt-stack, and the governed agents run as one substrate. That’s where the cuts converge.


If you’re working on governance, evaluation, or agent infrastructure and see overlap with any of this, I’d be glad to compare notes.

Eugene Dvorochkin, Fieldframe Labs, independent AI behavioral research and infrastructure since May 2025. Contact: edvorochkin@gmail.com. HLE writeup here. Methodology paper for the HLE submission here.


Share this post on:

Next Post
HLE Submission Methodology Paper — FF-STACK v8