autoresearch v1.1: closing the propose-apply-measure loop

LLM

agents

Python

optimization

opensource

What changed in autoresearch v1.1: the loop now drives a headless coding agent on every iteration, removing the human step that used to sit between propose and measure. With an illustrative TSP run, the two permission postures, and the plug-in critic backends shipped in the same release.

Author

Pedro Carvalho Brom

Published

June 4, 2026

Code repositorypcbrom/autoresearch

The original autoresearch pattern proposed by Andrej Karpathy is a small loop with a big claim: hand an LLM critic a single mutable file, a metric, and a time budget, and the critic can propose useful edits one iteration at a time. The catch, in the form the pattern shipped, was that a person still had to apply the proposed edit before the loop could measure the next result. The cycle had a hole in the middle.

autoresearch v1.1 closes that hole. The loop now calls a headless coding agent on every iteration, hands it the critic’s proposal, and waits for the agent to rewrite the mutable file before measuring. The post walks through what changed, what the loop owns, an illustrative run on a small traveling salesman problem, and the plug-in critic backends that ship in the same release. It does not benchmark coders; it shows the shape of the loop and what it produces.

What v1.1 changes

The contract of the loop is the same as before. The runner still needs:

A single metric the runner parses from stdout.
A time budget per iteration so runs are comparable.
A single mutable file the agent edits, plus read-only files that define data, evaluation, and constants.
A git branch dedicated to one session.

What v1.1 adds is one configuration flag, coder.enabled: true. With that flag, the loop owns the cycle end to end. The critic, an LLM exposed through an OpenAI-compatible endpoint, reads the previous best version and returns a structured JSON with thought_process, alternatives, hypothesis, code_pseudocode, and risk_level. The loop then prompts the configured coder, Claude Code, Codex, or OpenCode, to apply the proposal to the mutable file. The coder writes; the loop runs the experiment under the time budget; the result is classified as keep, discard, or crash; the git branch advances or resets accordingly.

Ownership stays with the loop. The agent only ever edits the mutable file. Commits, branch state, evaluation, and the keep-or-discard decision are the loop’s. If the agent fails to produce a valid edit, the loop records the failure and falls back to passive mode, waiting for a person to step in.

Two permission postures

The configuration exposes two postures for the coding agent. The first restricts the agent to edits of the mutable file only; no filesystem writes outside that path, no shell commands. This is the default, and it is enough to drive the loop with a small blast radius: the worst case of an agent mistake is a single file the loop already version-controls. The second posture allows broader autonomy, including the use of arbitrary tools and helper files. It is useful when the optimization needs the agent to introspect the environment, but it should only be selected in trusted environments and by explicit configuration.

Both share the same fallback: if the agent crashes, times out, or returns a proposal that cannot be applied, the loop logs the failure and goes back to passive waiting. The agent never compromises the experiment. The worst case is a lost iteration.

An illustrative run, 50-city TSP

To show the loop in motion, v1.1 ships an example that runs a small traveling salesman problem. The starting state is a random tour of 50 cities of length 25.25. The mutable file is a Python function that returns a tour. The metric is tour length.

In a three-iteration run with the default backend (a Gemma 3 family model served by a local Ollama), the critic proposed 2-opt in the first iteration; the coder applied it; the next iteration the critic proposed simulated annealing on top of 2-opt; the coder applied it; the third iteration tightened the schedule. The best tour landed at 5.63, no human touches between the call to start the loop and the final number.

The TSP example is illustrative, not a benchmark. The same tour length is reachable by a careful person running the same prompts manually. What the run shows is that the cycle no longer has the human-in-the-middle step. Whether the loop is useful on a given problem is decided by the contract (single metric, mutable file, read-only constants, time budget) more than by the LLM at the wheel.

Plug-in critic backends

The other deliverable of v1.1 is decoupling the critic from any single provider. A provider field in problem.yaml selects the backend without touching loop code. Five endpoints are supported in the release:

ollama, for any local model that honors JSON Schema responses (Gemma 3 and 4, Qwen 2.5, Llama 3, and so on).
vllm, for high-throughput local inference.
llama.cpp, for portable CPU or hybrid CPU/GPU serving.
openrouter, for routing across multiple hosted providers behind one API key.
openai, for the canonical OpenAI Chat Completions endpoint.

All five share an OpenAI-compatible interface. Provider credentials resolve from environment variables. The same loop runs locally, in the cloud, or in a hybrid setup without any code change.

Trade-offs to read with eyes open

Closing the loop has costs. With the coder in the cycle, an iteration costs more wall-clock time than a pure LLM call, because the agent reads, plans, and writes before the experiment runs. The token cost is also non-trivial when the agent is a paid endpoint: a 10-iteration session can use several thousand tokens of agent context per iteration on top of the critic’s. Budget accordingly.

A second cost is observability. With the agent applying changes, the diff between iterations is the agent’s interpretation of the critic’s proposal, not the critic’s text directly. The loop logs both, but a careful operator should read both, especially in early iterations of a new problem, before trusting the loop to run overnight.

How to try it

pip install -e git+https://github.com/pcbrom/autoresearch
ollama pull gemma3:12b
cd autoresearch/examples/tsp
autoresearch run --problem problem.yaml

The repository ships three examples: TSP, XGBoost hyperparameter tuning, and a multi-metric scalarization problem. Each one has a sample_run/results.tsv that records what a real session produced.

Citation

Brom, P. C. (2026). autoresearch v1.1. Zenodo. DOI 10.5281/zenodo.20544957. License MIT.