I built an agent that survived 97,000+ synthetic documents

What competing in America’s Next Top Modeler taught me about building eval-driven agents for enterprise

Andrew Zigler

Nov 20, 2025

I spent my Saturday at a hackathon where demos didn’t matter.

No judges’ panel. No pitch deck. No “most inspiring use of AI” trophy.

Instead, America’s Next Top Modeler (ANTM), hosted by Theory Ventures and their Head of AI

Bryan Bischof

, did something much more interesting: it turned agent building into an obstacle course.

You don’t get to be the player. You’re the coach. Your job is to build an agent that can run the course for you.

The course, in this case, was Retail Universe: a totally fake, totally cursed enterprise simulation constructed by Bischof full of messes and twists:

90k+ files of parquet, logs, PDFs, CSVs, and stray text
Disparate databases of data about an e-commerce company and its 3PL vendors
A physical paper binder with critical context and no clean digital twin
A human analyst competed alongside us as the human baseline
At least one training question turned out to be flat-out wrong

It was less “toy RAG example” and more “this is what happens when 10 departments ship systems for five years and nobody cleans up.” In other words: reality.

At the end of the day, my agent placed in the top 20 out of ~100 teams. More importantly, it gave me a clearer mental model for how to build agents that can survive enterprise chaos, not just look good in a demo.

This my latest entry in a series of agentic devlogs on how I’m building agentic systems. I came away from this hackathon even more convinced that evals are the only sane way to build AI systems.

“90s retro River City–style pixel art, scaled-up pixels, warm SNES palette enhanced with #eb5622 orange highlights, #111315 deep black shadows, and #f9f4f2 soft white lighting. Slightly isometric 3/4 angle. Backstage dressing room: vanity mirrors glowing with #f9f4f2 bulbs, garment racks, and server-rack–shaped wall lights trimmed in #eb5622. A stylish, androgynous model-agent with subtle circuit-thread patterns on their base outfit stands among diverse competitor silhouettes. Mood is anticipatory and glamorous. Camera medium-wide. Continuity: protagonist’s shape, tech-thread motif, and ANTM-inspired color accents remain consistent across all scenes.”

Planning the agent before seeing the dataset

Going into ANTM, I didn’t know the exact dataset, but I knew the shape of the challenge: build a context agent that can handle messy data, and do it in a single day.

So I started by writing a spec.

I spent the first half hour on BRAINSTORM.md, diligently splitting the premise into parts:

Ingest a tangled enterprise dataset
Untangle it into a usable schema that connects all the parts
Use the schema to answer questions against a hidden eval set

The questions we had to answer ranged in complexity. Here’s a question that was looking for a single string (representing the most successful quarter) as a response:

## the quarter that succeeds most

**Question:** What was our most successful quarter by far (by total net profit)?

**Observations:**
```json
{
  “question”: “string”,
  “successful_quarter”: “string”,
  “difficulty”: “int”
}
```

However, some questions involved synthesizing multiple answers that all depended on each other, making it impossible to guess:

## the funnel that needs optimization

**Question:** Our search funnel (Landing → Search → Item → Cart → Checkout) needs optimization. We notice mobile converts far worse, can you find the specific browser and funnel-stage combination where the dropoff is significantly worse than average for that funnel-stage. What’s the percentage dropoff for that funnel-stage and browser combination?

**Observations:**
```json
{
  “question”: “string”,
  “browser”: “string”,
  “worst_stage”: “string”,
  “conversion_percent”: “float”,
  “percentage_points_worse”: “float”,
  “difficulty”: “int”
}
```

Everyone in the room took a different approach. Some ran multiple Claude Code instances in parallel, indexed them on parts of the data, and asked them questions without wiring them together. That worked surprisingly well. But I decided to build an agent using LangGraph, with entry points to both ingest the 97,000+ documents and also answer questions about the data. In effect, I designed a harness for Cursor to use:

LangGraph as the orchestrator
LanceDB + RAG for unstructured retrieval
MotherDuck / DuckDB for SQL over structured data
DSPy for optimization and critic loops

DSPy was the glue layer that made this more than “just call the model again and hope.” Instead of raw prompts everywhere, I defined a small set of DSPy modules that wrapped the LLM in clear roles:

PlannerModule takes a schema summary and a natural language question and turns it into DuckDB SQL. It teaches the model what “correct” looks like, like joining store_sales to date_dim properly, use d_moy instead of d_month, and not free-styling column names.
CriticModule makes a single-pass repair loop for broken SQL. It reads the failed query, the error message, and the schema summary, then proposes a fixed version. It’s basically a focused “fix this, don’t reinvent it” CoT wrapper.
ExtractionModule turns semi-structured text (summaries, logs, RAG hits) into strict JSON. It knows it must emit valid JSON that matches a schema hint.
SchemaMatcherModule reads the MCP error (including candidate bindings) when SQL columns aren’t found, the attempted query, and the schema, and returns a JSON mapping like {”d_month”: “d_moy”} above.

DSPy was the precision layer I used to teach the LLM how to plan, fix, and map things inside a very opinionated environment. All of that plugged into the LangGraph workflow like this:

Router decides: is this PDF, SQL, logs, or hybrid?
Retriever pulls relevant chunks from vectors
Extractor turns text into JSON aligned with a small schema
Normalizer hardens types and IDs
Planner turns questions into SQL
Critic repairs bad SQL or broken JSON
Report node assembles the final answer with citations

It was the same philosophy I used for our AI code review benchmark: a harness first, results second. You don’t start by building the end goal. You start by building a loop that can evaluate whether any of it works.

“90s retro pixel art in a River City–inspired aesthetic, slightly isometric 3/4 angle. Backstage styling area lit with #eb5622 backlights and #f9f4f2 overhead glow, with #111315 silhouettes of unused props. The agent evaluates dramatic wigs on stands and dresses with subtle circuit filigree. A small Kanban-style whiteboard sits in the background. Accessories with tiny syntax-symbol motifs shimmer softly. Mood playful and exploratory. Camera moderately close. Continuity: same protagonist sprite, same tech-fashion motifs, server-rack lighting with ANTM palette.”

The fake enterprise that felt uncomfortably real

Then we all got smacked with the actual dataset. We knew it was going to be gnarly, but I don’t think anyone expected 97,283 documents.

Retail Universe was an e-commerce simulation dialed to 11. It even had lore (I hear there was an acquisition at some point). This wasn’t a tidy API, three tables, and a bundle of PDFs. This was a data swamp with a story.

On top of that, the organizers added a twist: They brought in a human analyst to compete without an LLM. No tools. No RAG. Just raw human pattern matching over the same mess of files the “old school” way.

The point wasn’t to embarrass anyone’s way of working, but instead celebrated the individual strengths and weaknesses of each.

What I actually built in six hours

I didn’t implement every line of my dream spec. Time is real. Scope creep is my personal demon. But here’s how I survived:

1. Engineering the context for Cursor

The first 20-30 minutes were not spent coding.

I used that time to:

Pull a copy of the hackathon site as Markdown so my agent had the same brief I did
Semantically organize Retail Universe into sane folders so Cursor could see the terrain
Convert my BRAINSTORM.md document into SPEC.md
Let Cursor scaffold the project structure for my LangGraph + LanceDB + DuckDB stack
Wire in the MCP server that the organizers gave us, so my agent could use their recommended tool for querying the SQL

2. A LangGraph spine glued to LanceDB + DuckDB

The functional harness looked like this:

LangGraph workflow to coordinate routing, planning, execution, and synthesis
DuckDB to query parquet-based fact/dimension tables
LanceDB to store vectorized chunks of PDFs and docs
DSPy modules to:
- Ingest docs for chunking and storing
- Turn questions and schema summaries into SQL queries
  - Use the MCP server to fetch the SQL, or
  - Scan the vector database for relevant unstructured data
- Act as a critic to repair queries when they failed or returned nonsense
- Evaluate results to synthesize correctly-formatted answers

Since Cursor built the harness and documented it, it was very simple to feed Cursor the questions and let it choose how to use the harness. We therein found our dance for the rest of the hackathon:

A question came in.
I categorized it manually with other questions we’ve seen that ask for similar pieces of information.
1. Why? Success for one would commonly mean success for the others. Finding the patterns allowed me to prioritize the easiest points.
I fed the question into Cursor, which then used the harness to feed documents into DuckDB/LanceDB and query for data. If it failed, the DSPy critic kicked in and patched the ingest or query.
The agent evaluated the data it had against the question, and repeated #3 until it had all the context it thought it needed to synthesize an answer.

I didn’t go wild with parallelization or branching. I watched nearby teams do that by spinning up an army of agents. Commonly, they lost track of how their own systems worked. I chose to build a harness so I could evaluate more empirically. It was slower than running a lot in parallel, but it ultimately stayed on rails.

3. DSPy and the `WHAT_WE_DID_WRONG.md` moment

At some point, frustrated with the agent’s inability to perform joins across the SQL, I asked Cursor, very vulnerably:

“What are we doing wrong with our current approach?”

It responded the way only a slightly judgmental copilot can: by writing a document called WHAT_WE_DID_WRONG.md.

The first line dragged:

“We’ve been fighting the LLM instead of teaching it.”

But it’s also completely right.

The doc basically roasted me for:

Adding too many rules to prompts
Building repair tools to patch bad SQL instead of fixing root causes
Debugging individual queries instead of optimizing the planner
Forgetting to use correct SQL examples as training signals for DSPy

It then proposed the obvious fix:

Take working SQL we’d already discovered
Turn those into dspy.Example objects
Use BootstrapFewShot to tune the planner
Focus on getting one question perfectly right, then generalize

In other words: Examples > Rules. Don’t tell the tool. Show it.

I didn’t have time to run a full-on DSPy optimization campaign, but even leaning into example-based prompting inside the planner helped. The planner made fewer wild guesses and more “this looks like past successful queries” moves that let us start knocking out questions.

“90s retro River City pixel art, warm SNES palette anchored by #eb5622 runway warmth, #111315 deep structural shadows, and #f9f4f2 reflective highlights. Slightly isometric. Dressing-mirror area: the protagonist stands in their final couture — sleek fabric woven with circuit-like threads subtly lit by the palette colors. Silhouetted stylists adjust details. The Kanban whiteboard remains faintly visible. Mood confident and rising. Camera medium shot. Continuity: same protagonist, same palette, same backstage environment.”

The binder, the wrong question, and loving the mess

Three curveballs from the challenge have stuck with me.

1. The physical binder

There was a paper manual with explanations that did not exist in clean, digital form. Because sometimes the policy you need to reference lives only in a dusty binder sitting on a warehouse shelf.

We had three choices:

Ignore it (and risk missing key details about the data)
Sit down and read the 30+ pages yourself (and risk wasting time on it)
Turn it into a new digital artifact that your agent could use

I went hybrid and skimmed the binder to find points that might relate to my schema. I wrote some bullet points manually for Cursor to reference. One of the questions I later answered came from this knowledge. Score!

2. The wrong training question

One of the training questions was simply wrong, turns out. Whether intentional or not, sometimes what you think is the rock solid foundation, actually isn’t!

Competitors realized it when the data, the binder, and any sane SQL all agreed on one answer, and the “ground truth” insisted otherwise. Since these were our training questions, for many they served as the evals and the basis of everything else. If they were wrong, so were many other things. Yikes!

That’s the dark side of eval-driven engineering: your evals can misguide if you’re not evaluating the right thing. But the lesson for me wasn’t “don’t trust evals.” It was to treat evals as instruments, not oracles. You still have to validate them, challenge them, and occasionally debug them.

3. Competing with a human analyst

Watching the human analyst attack the same problem set with no LLM was instructive.

A human workflow was specific and leveraged the large context of a data-trained mind:

Skim the tables
Sketch a mental model of how data flowed
Anchor yourself on a handful of reliable reference points
Ignore irrelevant noise

I realized that a good agent needed to mimic this behavior. Namely:

Build a minimal schema of reality (skim)
Discover reliable join paths (sketch)
Learn which sources are trustworthy (anchor)
Ignore 80% of irrelevant context (ignore)

So how could my agent decide what to skim, what to sketch, what to anchor, and what to ignore? Well, that’s where evals came in!

4. Evals as the hill to climb

I came away from ANTM even more convinced that evals are the only sane way to make AI systems better over time. A secondary loop formed once the first one was operational:

From the training questions and the questions I got right, I worked backwards to build evals
Cursor used those evals to sanity-check new SQL templates and planner tweaks as DSPy optimized
The more I captured traces to reference for future runs, the more the agent’s behavior stopped being mysterious and started being steerable

Capturing when things went wrong, learning from those mistakes, and applying them to new problems allowed Cursor and me to reason through the edge cases in the data and improve the harness.

“90s retro pixel art in River City–inspired chunky sprites, scaled-up pixels, slightly isometric 3/4 view. The catwalk runway glows with #eb5622 perimeter lighting, #f9f4f2 spotlights, and #111315 audience silhouettes. The agent struts down the runway in sleek fabric suit with no undershirt, woven with circuit-like threads subtly lit by the palette colors. At the side of the runway sits the fixed judges panel: a white-haired man in a purple suit, a young Latina woman in a yellow dress, and a mysterious man in all-black with sunglasses. Mood glamorous, high-stakes, iconic. Camera wide runway shot. Continuity: same runway geometry, same judges, same ANTM palette.”

What I’ll change in my own agent work

Emerging from ANTM, a few rules of thumb hardened for me:

1. No agent stack without evals from day zero.

I don’t want to ship another “agent” that doesn’t come with its own harness. Questions, scenarios, manifests. You need something that shows if it got better or worse.

2. Never let the router be an LLM.

Routing needs to be boring, deterministic, and debuggable. MIME types, file paths, simple rules. Save the model capacity for reasoning, not “guess what kind of thing this is.”

3. Design for humans-in-the-loop, not humans-on-the-sidelines.

Manually bucketing questions turned out to be a huge advantage. It gave me visibility into failure modes and let me target improvements. Agents shouldn’t replace your own analysis, they should just make analysis more actionable.

4. Artifacts are infrastructure.

Binders, specs, scratch notes, schema docs, evals… all of these are now first-class citizens in my mental model. They’re not “supporting docs.” They’re the substrate where humans and agents inculcate the culture for each project.

5. Love the mess.

If your system only works on clean docs and tidy tables, it’s not ready. Retail Universe was a great reminder that the real job to be done is building agents that can see through the fog of war, not pretend it doesn’t exist.

What this taught me about the future of software engineering

For me, ANTM wasn’t just a weekend hack. It was another data point in a year-long thread of building systems that can actually survive real-world mess:

Vibe coding live with Goose, letting agents reset and reshape a messy repo
Building a benchmark harness for AI code review that turned into a product pattern
Using Warp agents in a multiplayer quiz game to test agent-to-agent handoffs
Filming with CodeTV to build a web app with agents that can’t be interacted with by mouse, touch, or keyboard
Now, stress-testing a context agent against Retail Universe under a hard eval

Retail Universe was fictional, but the chaos it simulated is exactly what engineering teams deal with every day: scattered tools, uneven processes, inconsistent data, and a constant pressure to prove that AI is helping, not hurting.

That’s the same problem space LinearB lives in: agentic development isn’t just about building smarter agents, it’s about seeing clearly how humans and AI actually work together.

At ANTM, that meant comparing my agent’s work to a human analyst’s workflow, and then using evals to separate “cool behavior” from “actually useful”. Inside a real engineering org, the questions are eerily similar for scenarios like code review:

Are developers actually using Copilot, Cursor, and other AI tools?
Is AI-generated code being trusted and merged, or second-guessed and rewritten?
Is AI speeding up delivery, or just creating more reviews and rework?

That’s where LinearB comes in. We’re giving engineering leaders an AI productivity platform that makes this stuff measurable. AI insights that show how tools like Copilot and Cursor are really being adopted, with daily active users, engagement, and where AI is helping or quietly getting in the way. We go one step further and correlate with delivery metrics so you can see how AI usage lines up with cycle time, PR size, review load, and team health instead of guessing about ROI. It’s easy to sign up for a free account to start showing AI’s bigger picture for your organization.

Retail Universe was synthetic, but the chaos it modeled is real: scattered tools, incomplete context, and no single place to understand whether AI is making things better or just adding noise.

And for me, ANTM just reinforced why that matters.