How Hex Evaluates AI Agents for Data Analytics
Data analytics is a uniquely cursed domain for agents. Easy questions look hard, hard questions look easy, many are impossible to answer. Hex built custom eval infrastructure called The Shoebox and an entire fake business to test their agents against realistic data.
Why data analytics is uniquely hard for agents
Data analytics is a uniquely cursed domain for agents to operate in. Easy questions look hard. Hard questions look easy. Many questions are impossible to answer; to even try is to fail. Bugs are usually silent and subtle. Innocuous assumptions make or break analyses. There are no linters, no test suite, no formalization language. There is almost no realistic public data to train on, and everyone's data warehouse is out of distribution. For every right answer, there are ten plausible but subtly incorrect wrong answers, and no way to verify or validate the result.
The Shoebox - Hex's eval infrastructure
What started as a hacky tool to view agent traces has evolved into a full-fledged lab bench for agent observability and evaluation. It powers ad-hoc and scheduled evaluations for all agents, supports experimental treatments and pairwise comparisons, and even exposes agent skills that let coding agents experiment against evals in an autoresearch-like loop.
Pairwise experiment design
Everything about Shoebox is designed to help users think about evaluations as pairwise experiments with a 'candidate' and a 'baseline' run rather than standalone tests. It runs as part of the local dev stack for maximum flexibility, but connects to a shared internal workspace where eval sets run daily to establish shared production baselines accessible to everyone.
Hybrid local/remote workflow
Most people use an efficient hybrid workflow where they compare locally executed candidate runs against remotely executed production baselines. Even if 10 people are running 25 experiments between them, the baseline holds relatively static across the entire population. It's easy for anyone to spin up a new branch, make some code changes, and then run evals against a shared, consistent remote baseline.
Artisanal evals over sprawl
Eval sets are relatively small compared to public benchmarks. They prefer to artisanally craft strong, broadly applicable evals that are rich enough for people to get arbitrary signal out of by creating new rubrics on top of, rather than regularly making net new evals. Most eval sets run with additional rubrics like ToolEfficiency, SemanticLayerUsage, WorkspaceGuideAdherence.
Hypothesis objective rubrics
Users can create flexible run-scoped 'hypothesis objective' rubrics that allow for more targeted pairwise evaluation scoped to a particular experiment. These LLM-judged rubrics consider a candidate and baseline trajectory side-by-side at judge time, and even have access to post-run metadata so you can evaluate things like speed and cost in addition to behavior and accuracy.
Shorelane Commerce - A fake business with realistic data
Because building data agents is so hard, there's a dearth of great agentic analytics benchmarks. Most public eval sets are simple text-to-SQL tasks that don't actually map to the problem space. So they created a fake business: Shorelane Commerce, a B2B2C office-supplies platform founded in 2019, currently doing ~$129M in yearly revenue.
Realistic data debt
Shorelane has accumulated data debt that's now felt acutely. They migrated platforms in 2021 and lost some customer IDs. They acquired a competitor that year and never fully merged the data. They renamed a sales channel in 2022 without backfilling. They restructured subscription plans in 2023 and grandfathered enough customers that all three worlds are still in circulation.
Representative source systems
The source systems are a fairly representative set: Stripe, Salesforce, a legacy Shopify that's mostly a red herring, three ad platforms with three different conversion totals. Every customer has at least two IDs, and sometimes four. Five columns could plausibly be called 'revenue,' and finance, marketing, and ops stakeholders each habitually reference a different one.
Scale and complexity
Shorelane is represented by 30,000 handcrafted lines of data generators, dbt models, warehouse documentation, events, triggers, and stakeholder personas with their own histories. These produce six years of realistic, interesting data across millions of rows and dozens of tables. This means evals don't need to be strange, contorted prompts designed to trick the agent.
What still sucks
It's always fun to build internal tools. They fit like a glove. But there is quite a lot to dislike about the current setup.
Maintenance burden
The biggest issue is maintenance. This is a lot of surface area to support. They've carved out a 'demilitarized zone' in the monorepo where PR review criteria are lowered, and they vibe-code most of the frontend. They spent careful time up front on security and data handling, but the tool is perpetually in a bit of a janky state.
Core product integration
Many benefits of Shoebox stem from how deeply integrated it is with their actual application and execution environment. There's no eval-reality drift, because any improvements to Hex itself automatically take effect. But sometimes engineers doing product work unexpectedly need to contend with strange eval-specific wiring and piping.
LLM judge accuracy
LLM judge accuracy is a particularly problematic domain. LLM judges acting in a hybrid way, with ground truth available at judge time, is the best approach. But they still sometimes struggle to calibrate and align judges so that high-level aggregate numbers are perfectly trustworthy. They are biased towards being overly harsh.
Environment consistency
By far the biggest source of pain is the workspace and environment surrounding evals. It's really hard to maintain a consistent and stable environment that lets them test memory, workspace content, search, prior-artifact usage, and warehouse execution in a way that's reliable and consistent for everyone, but also flexible enough to not be annoying.
Sowing, reaping, evaluating
Maintaining all of this infrastructure takes up a lot of time and thought, but the tax is worth it to have such a flexible system that any engineer can use to run shared evals or spin up their own extremely custom eval sets and rubrics for new features they've worked on. As code becomes easier to generate, they're increasingly limited only by what they can dream up.