Autobrowse: Solving Browser Agents' Amnesia Problem

Kyle Jeong/x.com/@kylejeong/2026-05

Browser agents have an amnesia problem. They re-discover every site from scratch on every run, paying the full discovery tax forever. Autobrowse fixes that by letting an agent iterate on a real task until it converges, then graduating the winning approach into a durable, reusable skill.

The genius without a hippocampus

If you've shipped a browser agent into production, you already know the shape of this problem. The first run on a new site is exciting. The agent wanders around, figures out the page, eventually completes the task. The second run looks almost identical. The hundredth run is depressing. By then you've paid for the same exploration a hundred times, the cost graph is a straight line going up, and you still don't have a clean artifact you can hand to a teammate. Real sites are messy. They render differently for different user agents, gate content behind JavaScript, hide the data you actually want behind an undocumented JSON endpoint. A generic agent loop copes with all of that fluently in the moment, then forgets everything once the session closes.

What is Autobrowse?

Autobrowse is a workflow that uses AI to improve AI. You give an agent a real task on a real site. It runs the task end to end, studies the trace it produced, iterates on its strategy, and keeps going until the workflow becomes reliable rather than lucky. Once it converges, it graduates the winning approach into a reusable skill: a markdown file plus the deterministic glue (CLI calls, fetches, selectors, helper scripts) needed to repeat the job.

The learning loop

The core loop: (1) Objective: hand the agent a real task on a real site. (2) Run: let the agent attempt the task end to end against a live browser. (3) Study: the agent reads its own trace, noting where it stalled, guessed, or spent tokens it didn't need to. (4) Strategy: maintains a strategy.md scratchpad where observations compound across iterations. (5) Iterate: refine the strategy, drop steps that didn't pull weight, lean on deterministic helpers. (6) Converge: once consecutive iterations stop yielding meaningful improvements, short-circuit. (7) Graduate: write out a SKILL.md plus helper files.

The output

What comes out the other side is a small, readable markdown file. No transcript, no vector of embeddings, no screenshot reel. Just markdown with frontmatter describing the skill, plus the deterministic glue needed to repeat the job. If the agent discovered an undocumented JSON endpoint, that endpoint is in there. If a particular form needs a small wait before submission, that's in there too.

What is it good at?

Autobrowse shines on sites that genuinely require exploration: hidden or undocumented APIs that aren't visible from the rendered page but show up in network traffic; heavy client-side rendering where content only appears after a sequence of interactions; multi-step login or wizard flows where the right path isn't obvious from the first screen; any UI where the shortest reliable path is non-trivial enough that a human reverse-engineering it would take a couple of hours; token-saving opportunities where parts of the loop are redundant.

A concrete benchmark: Craigslist

Traditional Claude Code loop: ~$0.22, ~71s. Graduated Autobrowse skill: ~$0.12, 27s. The shape matters more than the absolute numbers. The first run costs what you'd expect from a generic agent loop. The end skill changes the unit economics of every subsequent run by an order of magnitude or more, because it encodes the shortest reliable path the agent could find and reuses it instead of re-deriving it. On an early form-fill experiment, cost dropped from $1.40/run to $0.24/run in four iterations.

Where Autobrowse breaks

Autobrowse is genuinely the wrong tool when the task is deterministic parsing. They learned this against a 167-row static HTML state catalog. Four iterations and ~$24 later, the loop still hadn't returned all 167 rows. The model's per-turn output cap kept truncating its reasoning. Once they recognized the regime mismatch, the agent pivoted to ~200 lines of deterministic Python with browse fetch and BeautifulSoup. Sub-second runtime, zero inference cost, all 167 rows surfaced. The lesson: probe with fetch first. If the data comes back cleanly, write the parser. If it's dynamic or gated, escalate to Autobrowse.

Why this changes workflows

A skill is legible, durable, debuggable, human-auditable, and ownable. An engineer can read it, edit it, and commit it. A non-engineer can also read it and roughly understand what the agent is doing without ever touching code. We go from 'just trust the agent's output' to 'read the agent's playbook.' The compounding effect matters too. Each new site an agent encounters yields one more durable skill. The library grows. The agent gets cheaper and faster on the long tail of repetitive workflows because it stops paying the discovery tax.

What we're working on next

Smarter stopping: letting the agent reason about its own convergence more explicitly, comparing not just cost and turns but the structure of its trace across runs. Better priors about how to explore: making sure the agent reaches for fetch and search primitives before spawning a full browser session. Recursive Autobrowse: using Autobrowse to graduate improvements to its own harness.

The bigger picture

A dominant story about browser agents right now is that they'll get good when the underlying models get good. We don't entirely buy that. Even a perfect model still has to discover (on every new site) what a perfect model would already know if it had been there before. Without a place to put what the agent learns, every run is a fresh start. The real bottleneck is memory, in a form humans and agents can both understand and trust.