Fieldframe Labs

Independent AI behavior research and reasoning-governance infrastructure since May 2025.

Latest result: 51.85% on Humanity's Last Exam — full 2,158-question text-only set, single workstation, limited budget.

What this is

Fieldframe Labs is an independent AI behavior research program. Since May 2025, the work has studied how large language models reason across architectures — Claude, GPT, Gemini, Grok — through long, open-ended empirical dialogue rather than predetermined prompting, with the models serving as both experimental subjects and constrained reasoning partners.

The methodology is empirical and recursive: observe a failure pattern, encode the countermeasure, test it across architectures, iterate, and integrate. Over time the loop produced more artifacts than the original research question expected — a governance framework (FF-STACK), a codified research agent (Cade, short for Cadence), a custom evaluation methodology (Crucible), a multi-agent research pipeline (Foundry), and a behavioral authentication system (Cortex). These were not pre-planned, and emerged naturally via the same loop.

The HLE submission above is the most public-facing result of that work. The research program underneath it has produced a body of cross-architecture data and codified patterns.

Current status

Cade / FF-STACK — production local agent
HLE v8 — submitted, text-only run
Methodology + writeups — published here
Framework source — closed
Predictions JSON — available for reviewer verification

Published research

Three writeups available right now. More to come.

About Eugene

Eugene Dvorochkin — independent AI researcher. No PhD, no lab affiliation, no funded position. The work runs on evening time, public APIs, and a year of empirical cross-architecture engagement with frontier LLMs. Contact: edvorochkin@gmail.com. More on the about page.

More to come

This is a year's worth of research that I'm finally starting to publish, and it will take time to get all of it out. I started with the latest concrete result — the HLE submission — and pulled some highlights from the broader work into the research post. The other products (Crucible, Foundry, Cortex, plus several behavior-research findings that haven't surfaced yet) will get their own writeups as I work through them.

If you're interested, follow along — either on this site or on Medium at @edvorochkin. New work will appear in both places as I get to it.

All Posts

Fieldframe Labs

What this is

Current status

Published research

51.85% on Humanity's Last Exam

HLE Submission Methodology Paper

The Research Behind the HLE Score

About Eugene

More to come