TD
Blog / Benchmarks

Building Better LLM Benchmarks

NBA Stats, Balatro, and the Cost of Curiosity. What I learned building RL environments with Prime Intellect, and why I'm now building CAMB.

What makes a human intelligent?

On its surface this might seem like a simple question. Standardized testing, income, quality of life, happiness. There should be ways to figure this out. But ask the people in your life who they consider to be the most and least intelligent and often you will find two peers you respect putting a third peer in both buckets respectively.

If intelligence is so obvious, why do smart people routinely disagree on what the smart action is?

The answer of course is that there is no answer to the question. Tests can be gamed, income can be stumbled into, and happy people can be dim. The same applies to benchmarking Large Language Models. As a former (and I like to tell myself moonlighting) data scientist, I tend to be very skeptical of benchmarks. Especially public benchmarks. There seems to be myriad ways to game them to the point that they stop becoming useful. The nasty part is when good benchmarks like the pelican on bike or balls in circle are released, they instantly become useless as LLMs begin to train on them.

I've been thinking a lot lately about agentic memory, specifically for coding agents, and the best way to benchmark it. Check out this discussion from Mutable Reference. They post on HN with a tool that gives Claude Code memory and the comments are full of people asking the same question over and over again.

Does this work?

The current coding benchmarks are focused on how well agents code, not how well they can answer questions like "what was the solution to auth we used 3 months ago on that one project when I was learning Rust?"

So we plod along, users seem to be finding the tools useful as there are several libraries dedicated to agentic coding memory, but there still does not appear to be a standardized benchmark for coding agent memory the way we have swe-bench.

One of the first questions I wanted to answer when building this benchmark was: how can I isolate the impact of the model on performance vs the memory architecture?

How would I know if the synthetic memory I had created was not just an LLM that was regurgitating facts, but a facsimile (a simple one of course) of a human who can reason, plan, and use memory effectively to complete multiple, long range tasks.

CAMB: a coding agent memory benchmark

So I'm building one. CAMB (Coding Agent Memory Benchmark) is my attempt at a standardized benchmark for coding agent memory. It measures three things:

The benchmark has three task types. ERC (Entity-Relation Completion) tests whether an agent can explore a repo and build a correct mental model of its entities, relationships, and structure. RQE (Relation-Query Entailment) tests whether an agent can verify hypotheses about code relationships using its memory. SWE sequences test whether memory from solving one issue helps solve the next related issue in the same repo.

Scoring is weighted: 30% formation, 50% retrieval, 20% resumption. There are five budget tiers (NANO through LARGE) that cap tokens, tool calls, and wall time, so you can measure how efficiently agents use memory under real constraints.

FORMATION 30% entity / relation extraction RETRIEVAL 50% query accuracy from memory RESUMPTION 20% sequential task carry-over CAMB SCORING ARCHITECTURE

I've built baselines (no memory, naive schema summary, rolling summary, exact-match key-value) and adapters for Claude Code, Claude with memory, and CMK. Task suites cover Qdrant, FastAPI, and Pydantic repos.

CAMB is in alpha. The architecture is stable, the task format is locked, and the scoring pipeline works. But the suite coverage is still thin and I'm iterating on the SWE sequence scoring.

I'm sharing it early because the questions matter more than polished results right now. If you have feedback on the approach, the task design, the scoring weights, or want to contribute task suites for other repos, reach out.

But building CAMB surfaced a hard problem: how do you isolate the memory system's contribution from the LLM's raw capability? If a coding agent with memory scores higher than one without, is that because the memory architecture is good, or because the underlying model is just stronger at multi-hop reasoning?

I needed to understand what LLMs can and can't do with structured tool use before I could design a fair memory benchmark. That question led me down a rabbit hole that ended at Prime Intellect's Environment Hub and, eventually, to building my first reinforcement learning environments: an NBA stats benchmark and a Balatro card-game benchmark.

This post walks through that journey. Not the polished version, but the actual messy process of building something, watching it break, and iterating until it worked.

Discovering Prime Intellect and the bounty program

I came across Prime Intellect while researching ways to make agents work better. They had written a remarkably thorough blog post on RLMS (Reinforcement Learning Management Systems) research engineering, and it immediately stood out. One of the clearest, most grounded explanations I'd seen of how to build environments for training LLM agents.

While exploring their work, I found their bounty program, designed to onboard new contributors into building RL environments:

At the time I looked, there was exactly one open bounty left unclaimed:

NBA bench, with a "more info coming soon" note.

So I dug around and found that Will Brown from Prime Intellect had a GitHub repo called nba-benchmark, which contained NBA player stats across two seasons. That was the starting point for what became my first environment.

The rules of the game

Prime Intellect's bounties fall under a "Document Search Environments" category. The requirements are specific:

And the critical difficulty constraint:

If gpt-4.1-mini gets a near-perfect score (90%+), your environment is too easy. If gpt-5 gets a near-zero score (10% or below), your environment is too hard, or broken.

That's a narrow target. You need to build something that genuinely separates frontier models from their predecessors, without being so brutal that everything collapses.

You also want to build an environment that measures the capability of LLMs so you can separate it from the performance of your retrieval system. There are many techniques we could use to enhance retrieval, but this benchmark is about the LLMs.

Attempt 1: the 5-tool kitchen sink

I didn't start with the 3-tool template. My first instinct was to give the model a rich set of tools and see what happens.

The stack

query_stats was the power tool. It let models say "give me the top 5 centers by PPG" and get a direct answer. At the time, this seemed like a feature.

How the retrieval layer works

Before any model touches the data, there's a pipeline that turns raw player stats into something searchable.

The raw data lives in JSONL files: one per season per phase (regular season and postseason). Each record has 16 fields: rank, player name, position, team, games played, minutes, FG%, FT%, 3PM, rebounds, assists, steals, blocks, turnovers, points, and a composite rating.

At startup, the environment groups all records by player name and generates a markdown document for each player. A single player's document contains every season and phase they appear in. That full markdown document is what gets embedded.

For embeddings I used BGE-small-en-v1.5 (from BAAI), which produces 384-dimensional vectors. It's a lightweight model, good enough for this use case where the documents are short and structured. Each player's full markdown document gets compressed into a single 384-dim vector.

Those vectors go into Qdrant, running in-memory (no persistent storage, rebuilds every time the environment starts). The collection uses cosine similarity for distance. There's also a text index on the player name field with word tokenization, which helps when the model searches for a player by name directly rather than by description.

When search_players gets called, it takes the model's natural language query, embeds it with the same BGE model, and does a cosine similarity search against all ~600 player vectors. The top results come back with metadata only: player name, position, and team list. No stats. The model has to make further calls to get the actual numbers.

I deliberately kept the retrieval simple. No reranking, no hybrid search, no query expansion. The benchmark is supposed to test the LLM, not the retrieval stack.

The baseline results

Early runs on small samples (20 questions) came back all over the place. GPT-4.1-mini ranged from 0.55 to 0.825 depending on which questions got sampled. Not stable enough to draw conclusions. I needed harder, more consistent questions.

The Dyson Daniels mystery

While debugging why certain questions were failing, I noticed something weird: Dyson Daniels, the NBA steals leader at 3.0 STPG in 2024-25, wasn't showing up in queries about top-stealing guards.

The root cause: Daniels was listed with position "G" (generic guard), not "SG" or "PG". The query_stats tool only accepted the five standard positions as filters. So when you searched for shooting guards by steals, the actual steals leader was invisible.

This wasn't just Daniels. 314 players were listed as "G" and 286 as "F" (generic forward). That's a huge chunk of the dataset being silently filtered out.

My first instinct was to clean the data. But then I thought: this is exactly the kind of real-world data messiness that makes a benchmark interesting. Instead, I added "G" and "F" as valid position filters and updated the system prompt to explain: "To find all guards, you may need to query PG, SG, and G separately."

This turned a data bug into a reasoning challenge. Does the model read the system prompt carefully? Does it think to check multiple position codes? Most don't.

Making it harder

With the data issues fixed, I generated 93 hard questions across 11 categories: exact arithmetic, precise ordering, counting edge cases, conditional aggregation, cross-dataset joins. I kept pushing: remove the questions models got right, add nastier ones.

The final 5-tool results:

ModelAccuracyAvg tool calls
GPT-4.1-mini43-50%8.9
GPT-586-89%20.8

Both within Prime Intellect's bounds. GPT-5 was brute-forcing its way to accuracy, hammering query_stats and read_section relentlessly. It averaged 20.8 tool calls per question vs. mini's 8.9.

I thought I was done. I was not.

The efficiency experiment (a detour)

GPT-5 making 20+ tool calls per question bugged me. In a real agentic system, every tool call costs time and tokens. A model that brute-forces its way to the answer isn't really "reasoning." It's just being thorough enough to stumble into correctness.

So I added two constraints: reduced max_turns from 15 to 8, and added an efficiency reward (weight 0.2): full bonus for ≤5 tool calls, linear decay from 6-10, near-zero beyond that.

ModelJudge accuracyTotal score
GPT-4.1-mini55.4%0.554
GPT-573.2%0.732

Wait. GPT-4.1-mini improved from ~50% to 55%? I made it harder and the score went up?

The efficiency bonus was additive. A model that got the answer wrong but used few tool calls would get a nonzero score from the efficiency bonus alone. The total possible score was 1.2, not 1.0. The numbers were inflated.

Lesson learned: if you're going to penalize efficiency, use multiplicative scoring, not additive. final_score = accuracy * efficiency_factor keeps things on a clean 0-1 scale.

I backed out the efficiency reward. But the turn cap insight was valuable.

The pivot: reading the template more carefully

At this point I went back and read Prime Intellect's requirements more carefully. They had a reference implementation that used exactly 3 tools: search_patents, view_sections, read_section. No direct query tool.

The reason hit me: query_stats was doing the hard work for the model. When you can ask "top 5 centers by PPG" and get a sorted list, you're testing the retrieval layer, not the model's reasoning. The model doesn't need to search, compare, or compute. It just reads the answer from the tool output.

The benchmark was supposed to test LLM inference. Mine was testing my search infrastructure.

Time to start over.

Attempt 2: the 3-tool redesign

I stripped the environment down to match the template:

No more query_stats. No more view_player. No more list_teams.

search NAMES ONLY view SECTIONS read STATS compare REASON answer

Now to answer "who scored more: OKC's or BOS's leading scorer?", a model has to: search for OKC players, view sections, read stats, repeat for BOS, compare. That's 6+ tool calls minimum for a simple comparison. No shortcuts.

A real trace: the same question, two models

Question: "How did OKC's leading scorer's PPG change in the postseason?"

GPT-5 (4 calls, correct)
> search("OKC leading scorer") Shai G-A (PG, OKC) ... > view(Shai G-A) [Regular, Postseason] > read(regular) 32.7 PPG > read(postseason) 29.9 PPG CORRECT: -2.8 PPG
GPT-4.1-mini (12 calls, wrong)
> search("OKC") Jahlil Okafor (IND) > search("OKC thunder") Westbrook (DEN/LAC) > view(Keyontae Johnson) > view(Luguentz Dort) > view(Ousmane Dieng) > read(Dort: 10.1 PPG) > read(Dieng: 3.8 PPG) > read(Hartenstein: 11.2) ...4 more calls... WRONG: Hartenstein 11.2

GPT-5's first search immediately surfaced SGA. Mini never found him. It burned all 8 turns checking bench players and confidently declared Isaiah Hartenstein the leading scorer.

Named-player questions: too easy

I initially kept questions that named specific players: "How did Shai Gilgeous-Alexander's PPG change from 2023-24 to 2024-25?"

With the 3-tool design, this was trivial. Search the name, view sections, read both seasons, subtract.

96.9%
GPT-4.1-mini
93.8%
GPT-5
0
Separation

Mini actually outscored GPT-5 on easy named-player questions. When the task is just "look up a name and read a number," the cheaper model is perfectly capable. GPT-5's advantage only shows up when reasoning is required.

The AI-generated question disaster

I tried using GPT-5 itself to generate harder questions from the data. Fed it all the player stats and asked for 120 creative questions. It produced 124 questions that looked great on paper.

Both models scored 11.7%.

The questions weren't hard. They were broken. Many referenced stats not in the dataset. Others used ambiguous phrasing. Some had flat-out wrong reference answers.

Never use an LLM to generate benchmark questions without rigorous verification against the actual data.

Getting it right: programmatic generation

The solution was generating questions directly from the data, with no LLM in the loop. I wrote a Python script that loads all player records, computes actual correct answers, generates questions using indirect descriptions (team, position, rank, never player names), and validates every answer against the source data.

The critical design insight: never name specific players in the question. Force the model to discover who you're asking about through search and reasoning.

The final set: 102 questions with a deliberate difficulty gradient:

EASY 25 Qs MEDIUM 35 Qs HARD 42 Qs
DifficultyCountExample
Easy (~25%)25"Who was the leading scorer on OKC?"
Medium (~35%)35"Which team had a higher-scoring leader: OKC or BOS?"
Hard (~40%)42"AST/TO ratio of the #1-ranked scorer in 2024-25?"

The results

With 8 turns (capped)

ModelAccuracyTime (32 Qs)
GPT-4.1-mini19.4%~41s
GPT-545.2%~174s

Both within Prime Intellect's bounds. But something felt off. I dug into the failure logs.

The models weren't getting questions wrong. They were running out of turns. GPT-4.1-mini: 18 of 25 failures hit the 8-turn limit. Only 7 were actual wrong answers. GPT-5: 16 of 18 failures hit the turn limit. Only 2 were wrong answers.

With 15 turns (uncapped)

Model8 turns15 turnsChange
GPT-4.1-mini19.4%35.5%+16.1pp
GPT-545.2%80.6%+35.4pp
80.6%
GPT-5 (15 turns)
35.5%
GPT-4.1-mini (15 turns)
45pp
Gap
GPT-5 80.6% 4.1-MINI 35.5% 45pp 0% 50% 100%

Now the picture is clear. GPT-5's advantage is genuine reasoning ability, not just brute-force tool calling. When you give both models enough runway, the stronger model pulls further ahead. Not because it makes more calls, but because it chains them together correctly.

What makes this benchmark hard

1. Indirect identification

When you ask "who led OKC in rebounds?", the model has to search for OKC players, figure out which one is the top rebounder (which search_players doesn't reveal), then read multiple sections to compare. Mini often gives up after checking one or two players and guesses.

2. The position trap

The Dyson Daniels problem generalizes. Players listed as "G" instead of "PG" or "SG" won't surface in position-specific searches. The model needs to know this and search broadly.

3. Multi-hop arithmetic

Questions like "what is the combined PPG of DAL's top two scorers?" require searching, reading two separate sections, extracting PPG from each, and adding them. Any mistake in the chain cascades.

4. Semantic search limitations

Searching "best blocker on the Spurs" might return Victor Wembanyama at the top, or it might not. Models need to search, then verify their results by actually reading the data.

Question: "Find a guard who averaged more than 10 assists per game" Answer: Trae Young (11.6 APG) GPT-4.1-mini searched for guards and got: Vit Krejci, Chris Duarte, Anthony Edwards, T.J. McConnell Checked all four, found none averaged 10+ assists, concluded: "No guard in the dataset averaged more than 10 assists per game." It never thought to search for "point guard assists leader" or "Trae Young." The model trusted its first search results as exhaustive and gave up.

5. GPT-5's brute force (when it works)

On the DEN assists question, GPT-5 made 29 tool calls in 8 turns. It systematically looked up every Nuggets player it could find. It found Jokic had 10.2 APG early but kept checking everyone else to be sure. Despite the thoroughness, it ran out of turns before producing a final answer. Correct reasoning, insufficient turn budget.

The cost of experimentation

One big, practical takeaway: LLM tokens are expensive, and that cost compounds quickly when you're running benchmarks.

Over the course of building this, I ran 10+ evaluation passes across two models, with multiple question sets, at two different turn caps. Some of those runs were on broken question sets that scored 11.7% and told me nothing useful.

The early iterations (the 5-tool version, the efficiency experiments, the multiple rounds of question hardening) all burned tokens that ultimately got thrown away when I redesigned for 3 tools. That's the nature of iterative development, but it adds up fast.

If you're not funded, experimentation like this can get prohibitively expensive. Programs like Prime Intellect's bounty system help by providing inference credits, but it's worth saying out loud: there's real financial friction to doing serious, empirical work with LLMs right now.

Extending to multi-turn reasoning: the Balatro benchmark

After building the NBA benchmark, I wanted something that captured multi-step decision-making over time, not just one-shot question answering.

If you're familiar with Balatro, it's a roguelike deck-building card game with probabilistic outcomes, combinatorial card effects, and long-term planning requirements. That made it a great candidate environment.

I built a Balatro benchmark where models observe the current game state, decide on actions, and try to maximize performance over many turns. The results confirmed the same pattern: better models did better, but even the strongest models were still pretty terrible at the game.

It shows how far we still have to go in terms of planning, strategy, and long-horizon reasoning, even when models look impressive in chat-based settings.

What we'd do differently

  1. Smarter turn budgets. Give each question a turn budget proportional to its complexity. A simple team lookup gets 4 turns. A cross-season comparison with arithmetic gets 12.
  2. Multiplicative efficiency scoring. Use final_score = accuracy * efficiency_factor to keep things on a clean 0-1 scale.
  3. Token-level cost tracking. Tool calls aren't the only cost signal. Tracking total tokens generated would give a more accurate picture of computational cost per question.
  4. More models. Running Claude 4.5 Sonnet, Qwen3, and other frontier models would validate whether the benchmark generalizes.
  5. Adaptive difficulty. A dynamic system that serves harder questions when the model is doing well.
  6. Efficiency as a first-class metric. Measure accuracy per tool call. A model that answers correctly in 3 calls is more valuable than one that answers correctly in 15.

The full scorecard

StageDesignModelAccuracy
Baseline5 tools, 15 turnsGPT-4.1-mini43-50%
Baseline5 tools, 15 turnsGPT-586-89%
Efficiency5 tools, 8 turnsGPT-4.1-mini55%*
Efficiency5 tools, 8 turnsGPT-573%
Named players3 tools, 8 turnsGPT-4.1-mini96.9%
Named players3 tools, 8 turnsGPT-593.8%
AI-generated3 tools, 8 turnsBoth11.7%
Final (capped)3 tools, 8 turnsGPT-4.1-mini19.4%
Final (capped)3 tools, 8 turnsGPT-545.2%
Final3 tools, 15 turnsGPT-4.1-mini35.5%
Final3 tools, 15 turnsGPT-580.6%

The eight iterations

01 5-tool kitchen sink 43-50% / 86-89% 02 Hard question rounds data quality issues 03 Dyson Daniels debugging bug → reasoning challenge 04 Efficiency experiment additive scoring = broken 05 3-tool redesign right structure, too easy 06 Named-player questions 96.9% trivial 07 AI-generated questions 11.7% broken 08 Programmatic + indirect questions 35.5% 80.6%

Closing thoughts

Benchmark design is itself an engineering problem. Every decision (how many tools to expose, whether to name players in questions, how many turns to allow, how to score efficiency) shapes what your benchmark measures and what it rewards. Get them wrong and you're measuring noise.

That last version finally separated the models in a way that felt meaningful. Not because the questions were tricky or adversarial, but because they required genuine multi-step reasoning through a minimal tool interface. The strong model pulls ahead not by making more calls, but by chaining them together correctly.

If you're interested in building your own RL environments, stress-testing agentic LLM setups, or contributing benchmarks that can actually move the needle, programs like Prime Intellect's are a great on-ramp. And if you do build a benchmark where frontier models struggle in interesting, measurable ways, you're not just making a leaderboard. You're helping define what "intelligence" means for the next generation of models.

NBABench environment on GitHub
Prime Intellect Environment Hub