BenchPark

The place for keeping up with all things AI.

Curated by Thomas Faulds

AI Glossary →

Good Reads

How to Eval AI Agents — The 2026 GuideCurator's Choice

Ben Hylak/howtoeval.com

A practical framework for evaluating AI agents in production. Distinguishes between benchmark-maximizing and floor-raising approaches, arguing most product teams should adopt floor raising — detective work reviewing actual user interactions, identifying failure patterns, and systematically preventing regressions.

aiagentsevaluation

Shipping a Trillion Parameters: Delta Weight Sync in TRL

Amine Dirhoussi et al./Hugging Face Blog

A novel technique for async reinforcement learning that reduces per-step model weight transfer from terabytes to megabytes by exploiting bf16 sparsity. Routes only changed weight elements through Hugging Face Buckets to inference servers, enabling fully disaggregated training across distributed machines.

reinforcement-learningdistributed-trainingoptimization

A Guide to Claude Code 2.0 and Getting Better at Using Coding Agents

Sankalp/Bear Blog

A comprehensive guide exploring Claude Code 2.0 and Opus 4.5, covering the evolution of the tool, practical workflows, and advanced features like sub-agents and context engineering. Shares strategies for maximizing productivity with AI coding agents.

claude-codeagentsworkflows

Claude Code Tips from @claudedevs

Claude Devs/X (Twitter)

Thread from the official Claude Devs account sharing practical tips, patterns, and workflows for getting the most out of Claude Code.

claude-codetipsworkflows

How Hex Evaluates AI Agents for Data Analytics

Hex Team/hex.tech

Data analytics is a uniquely cursed domain for agents. Easy questions look hard, hard questions look easy, many are impossible to answer. Hex built custom eval infrastructure called The Shoebox and an entire fake business to test their agents against realistic data.

evalsdata-analyticsagents

Autobrowse: Solving Browser Agents' Amnesia Problem

Kyle Jeong/x.com/@kylejeong

Browser agents have an amnesia problem. They re-discover every site from scratch on every run, paying the full discovery tax forever. Autobrowse fixes that by letting an agent iterate on a real task until it converges, then graduating the winning approach into a durable, reusable skill.

browser-agentsautomationskills

How to Build a Minimal AI Agent

Kilian Lieret, Carlos Jimenez, John Yang, Ofir Press/minimal-agent.com

A tutorial on building a simple AI agent from scratch for software engineering and terminal tasks. Demonstrates that effective agents don't require complex frameworks: just a loop of prompt, action proposal, execution, and feedback.

agentstutorialcoding

Cool Stuff

{}

chanhdai.com

Pixel-perfect developer portfolio built with Next.js 16, Tailwind v4, and shadcn/ui. Includes a component registry, MDX content system, and clean design engineering. Great reference for modern portfolio sites.

portfolionext.jsdesign
::

auth.md

Open protocol for AI agent registration. Standardized Markdown file hosted at an app's domain that lets agents register users without traditional sign-up forms. Supports agent-verified and user-claimed auth flows.

authenticationagentsprotocol
{}

Pi Dynamic Workflows

Extension that adds dynamic workflow tool to Pi. Model writes JavaScript scripts that fan out tasks across multiple isolated subagents for parallel execution, then consolidates results. Great for codebase audits and large refactors.

workflowsagentsparallel
[]

Browser Use + Terminal

Connect AI agents to the browser. Automate web tasks with natural language through CDP. Also check out their Rust TUI for live browser agent control with task management, screenshots, and 2x cheaper/faster agent loop.

browserautomationai
>>

Compound Engineering Plugin

AI skills and agents that make each unit of engineering work easier than the last. 37 skills and 51 agents for brainstorming, planning, code review, and compounding learnings across Claude Code, Codex, and Cursor.

claude-codeskillsengineering
[]

Pullfrog

Open-source CodeRabbit alternative that runs in GitHub Actions. AI-powered bot that reviews PRs, fixes CI failures, resolves merge conflicts, and runs custom workflows. Model-agnostic and GitHub-native.

githubcode-reviewai
{}

Tracebase

Secure, local-first trace capture and inspection for Codex and Claude agent sessions. Encrypts raw events at rest, builds searchable indexes, and serves a localhost dashboard for debugging agent runs.

observabilityagentsdebugging
>>

Last30Days Skill

AI agent skill that researches topics across Reddit, X, YouTube, TikTok, HN, Polymarket, and GitHub simultaneously, then ranks by real engagement and synthesizes into grounded briefs with citations.

researchmulti-platformagents
[]

Camofox Browser

Stealth headless browser server for AI agents. Wraps the Camoufox Firefox engine with C++ level anti-detection, fingerprint spoofing, and a REST API optimized for agent automation. Idles at ~40MB.

browserstealthagents
{}

DS4 (DwarfStar)

Lightweight inference engine for DeepSeek V4 on personal machines. Metal + CUDA backends, 2-bit quantization, 1M token context, compressed KV cache, and an integrated coding agent.

inferencelocal-llmdeepseek
[]

Roughdraft

Local-first markdown editor for collaborating with AI agents on document review. CriticMarkup for inline comments and edits, MCP integration, no cloud dependency.

markdowneditinglocal-first
>>

Matt Pocock's Engineering Skills

Collection of Claude Code skills for daily dev work: TDD, diagnosing bugs, codebase architecture, triaging issues, prototyping, converting specs to PRDs and GitHub issues, and more.

claude-codeskillstdd
{}

AgentField

Build, run and scale AI agents like APIs. Open-source control plane with multi-language SDKs, cross-agent communication, human-in-the-loop, cryptographic identity, canary deployments, and built-in observability.

agentsinfrastructureopen-source
{}

Mini SWE Agent

The 100-line AI agent that solves GitHub issues. Scores >74% on SWE-bench verified. Radically simple: bash-only tool, linear message history, no huge configs. By the Princeton/Stanford team behind SWE-bench.

swe-benchagentsminimal