Back to selected work

2026

ARC AGI agent workbench

Independent agent research project

Built a modular experimentation framework for ARC-style agents, with pluggable bots, benchmark tooling, and prediction-based evaluation.

Public research repo
PythonuvpytestARC toolkitrequests

Created a shared framework for running and comparing multiple agent architectures on the same tasks.

Added structured board and transition summaries so agents could reason over interpretable state descriptions.

Measured strong short-horizon prediction accuracy while documenting the gap to actual task completion.

Context

This project is a research workbench for experimenting with ARC-style game-playing agents. It is not a finished solver. The goal is to make it easy to compare agent designs, inspect behavior, and produce reproducible artifacts while iterating.

What I built

I set up the repo as a shared framework rather than a one-off bot.

A captured ARC board state used during agent debugging and evaluation.

A debug snapshot from the evaluation loop before the agent acts.

Technical ideas

One part I particularly liked was the evaluation design. Instead of only asking whether a level was solved, I added a prediction-first loop that scores agents on exact next-state matches and per-cell accuracy.

The repo also builds structured board summaries using connected components, dominant colors, changed regions, and transition diffs so agents can reason about richer descriptions than raw grids alone.

Result

The current state is intentionally honest: it is a research prototype, not a solved benchmark. Some agents already make strong short-horizon predictions, but turning that into reliable task completion is still the hard part.

That makes the project valuable because it captures the infrastructure and evaluation work needed before stronger agent behavior can emerge.

A model-generated prediction frame from the same debugging workflow.

The framework records predicted states as artifacts so agent reasoning can be evaluated directly instead of only looking at success or failure.

Why this framing matters

I built the evaluation loop this way because success or failure alone hides too much information. Prediction quality, transition summaries, and saved artifacts make it much easier to see whether an agent is learning the environment or just stumbling through it.

Reach out

Want more detail than I can share publicly?

I can walk through the architecture, tradeoffs, and implementation details for private work in a conversation.