Fieldframe Labs
Independent AI behavior research and reasoning-governance infrastructure since May 2025.
Latest result: 51.85% on Humanity's Last Exam — full 2,158-question text-only set, single workstation, limited budget.
What this is
Fieldframe Labs is an independent AI behavior research program. Since May 2025, the work has studied how large language models reason across architectures — Claude, GPT, Gemini, Grok — through long, open-ended empirical dialogue rather than predetermined prompting, with the models serving as both experimental subjects and constrained reasoning partners.
The methodology is empirical and recursive: observe a failure pattern, encode the countermeasure, test it across architectures, iterate, and integrate. Over time the loop produced more artifacts than the original research question expected — a governance framework (FF-STACK), a codified research agent (Cade, short for Cadence), a custom evaluation methodology (Crucible), a multi-agent research pipeline (Foundry), and a behavioral authentication system (Cortex). These were not pre-planned, and emerged naturally via the same loop.
The HLE submission above is the most public-facing result of that work. The research program underneath it has produced a body of cross-architecture data and codified patterns.
Current status
- Cade / FF-STACK — production local agent
- HLE v8 — submitted, text-only run
- Methodology + writeups — published here
- Framework source — closed
- Predictions JSON — available for reviewer verification
Published research
Three writeups available right now. More to come.
-
51.85% on Humanity's Last Exam
Narrative writeup of the FF-STACK v8 submission. What the architecture looks like at the box-level, where the lift came from across model generations, what the score reveals about governance-as-scaffolding, and where the next version goes. Best entry point if you came from the leaderboard.
-
HLE Submission Methodology Paper
Formal methodology document supporting the leaderboard submission. Includes the cross-generation ablation table, filtering policy with audit basis, calibration disclosure, and full reproducibility data. The reviewer-facing version of the HLE writeup.
-
The Research Behind the HLE Score
Deep dive into the year-long cross-architecture behavior research program. How the patterns were discovered, what failure modes the work catalogues, the four other products in the Fieldframe ecosystem (Crucible, Foundry, Cortex, and the txt-stack lineage), and where the program goes next.
About Eugene
Eugene Dvorochkin — independent AI researcher. No PhD, no lab affiliation, no funded position. The work runs on evening time, public APIs, and a year of empirical cross-architecture engagement with frontier LLMs. Contact: edvorochkin@gmail.com. More on the about page.
More to come
This is a year's worth of research that I'm finally starting to publish, and it will take time to get all of it out. I started with the latest concrete result — the HLE submission — and pulled some highlights from the broader work into the research post. The other products (Crucible, Foundry, Cortex, plus several behavior-research findings that haven't surfaced yet) will get their own writeups as I work through them.
If you're interested, follow along — either on this site or on Medium at @edvorochkin. New work will appear in both places as I get to it.