Fieldframe Labs
Latest result: 51.85% on Humanity's Last Exam , across the full 2,158-question text-only set.
Independent AI behavioral research and reasoning-governance infrastructure since May 2025.
What this is
Fieldframe Labs is an independent AI behavioral research program. Since May 2025, it has studied how large language models reason across architectures, including Claude, GPT, Gemini, and Grok, through long, open-ended dialogue rather than fixed prompting. The work began as a broad exploration, but narrowed around a central question: can structured constraints on how a model approaches a problem produce reasoning gains that transfer across architectures and model generations?
Over time, the research produced more than that question alone called for: a governance framework (FF-STACK), a codified research agent (Cade, short for Cadence), a custom evaluation methodology (Crucible), and a multi-agent research pipeline (Foundry). None of these were pre-planned. Each emerged from the same research loop as the work became more systematic and repeatable.
The HLE submission is the most public-facing result so far. Beneath it is a broader research program that has produced cross-architecture data, codified patterns, and a methodology for improving reasoning at the inference layer.
Why it matters
Most frontier AI progress has been driven by scale: larger models, more compute, and more data. The success of that approach is undeniable, as the rapid advance of LLMs has shown. Fieldframe poses a different question: how much reasoning capability remains untapped at the inference layer? By structuring how a model approaches a problem, including how it weighs evidence, verifies claims, manages uncertainty, and reconciles disagreement, you can recover reasoning gains that transfer across architectures and model generations without modifying the underlying weights.
The HLE submission is the proof point: 51.85% on the full text-only set, at roughly $1.60 per question. Cost is flagged because reasoning gains that require a GPU cluster or proprietary fine-tuning are difficult to generalize; gains that show up at near-baseline API cost are portable. If a method is real and architecture-agnostic, it should produce lift without making the economics impractical.
Fieldframe is a consistent operating environment for observing and structuring language model behavior, with the models themselves as both subject and instrument. The models being evaluated help identify their own failure modes, catalog recurring patterns, and propose countermeasures. That makes the process compounding rather than static. Each new architecture and model generation expands the evidence base and improves the methodology for the next iteration. Crucible (the evaluation methodology) and Foundry (the multi-agent research pipeline) are the first steps toward automating a loop that is still mostly run by hand.
Current status
- Cade / FF-STACK: production local agent
- Leaderboards: Zoom AI HLE, text-only
- Methodology + writeups: published here
- Framework source: closed
- Predictions: scrubbed public file on GitHub; raw available on request
Published research
Three writeups so far. More to come.
-
51.85% on Humanity's Last Exam
A narrative writeup of the FF-STACK v8 submission: the architecture at a box level, where the lift came from across model generations, and what the score says about governance as scaffolding. Best entry point if you came from the leaderboard.
-
HLE Submission Methodology Paper
Formal methodology document supporting the leaderboard submission. Includes the cross-generation ablation table, filtering policy with audit basis, calibration disclosure, and full reproducibility data. The reviewer-facing version of the HLE writeup.
-
The Research Behind the HLE Score
Deep dive into the year-long cross-architecture behavioral research program. How the patterns were discovered, what failure modes the work catalogues, the rest of the Fieldframe ecosystem (Crucible, Foundry, and the txt-stack sandbox), and where the program goes next.
About
Eugene Dvorochkin is an independent AI researcher. The work runs on evening time, public APIs, and a year of hands-on cross-architecture work with frontier models. Contact: edvorochkin@gmail.com. More on the about page.
More to come
This is a year's worth of research that is finally being published, so it will take time to get it all out. Starting with the latest concrete result, the HLE submission, and pulling some highlights from the broader work into the research post. The other products (Crucible, Foundry) and several behavioral-research findings that haven't surfaced yet will get their own writeups over time.
If you're interested, follow along here or on Medium at @edvorochkin. New work will show up in both places.