Skip to content

AI Model Evaluation

LLM Model Benchmarks

Which evals matter, what they measure, and how to map benchmark evidence to coding, law, finance, automation, security, speed, and cost per task.

Updated 2026-06-01

Benchmark map

All benchmark types

A compact map of the evals worth knowing before comparing Claude, ChatGPT, Gemini, open-weight models, or any new frontier release.

Software engineering and agents

Codebase repair, fresh coding, terminal work, tool selection, and browser or OS agents.

Workflow automation and business work

Professional tasks, research, RAG, documents, and long-context operating work.

Expert domains

Law, finance, healthcare, math, and science-heavy model selection.

Risk, preference, and operations

Security, safety, live general quality, human preference, speed, and cost.

Selector

Pick the work, then pick the benchmarks

Choose the outcomes that matter. The selector returns the benchmark families that should carry the most weight.

Selected work winner

Select priorities

Select one or more priorities to rank models against the benchmark mix that matters for that work.

Choose priorities

Most relevant benchmark families

Repository issue repairw

1.40

Whether an AI agent can understand a real codebase, modify files, and pass tests for a real issue or enterprise-like software task.

SWE-bench, SWE-bench Verified, SWE-bench Pro, Multi-SWE-bench

Terminal and developer-environment agentsw

1.35

Whether a model can operate in a shell, inspect files, run tools, debug failures, and persist through multi-step technical tasks.

Terminal-Bench, terminal-based agent evals

Long-context retrieval and memoryw

1.25

Whether a model can use long inputs, retrieve details, connect distant facts, and avoid losing important context.

RULER, InfiniteBench, GraphWalks, needle-in-haystack variants

Function calling and tool selectionw

1.20

Whether the model can choose the right tool, format calls correctly, chain calls, handle parallel calls, and recover from tool outputs.

Berkeley Function-Calling Leaderboard, MCP Atlas, Toolathlon

Fresh coding and contest programmingw

1.15

Algorithmic problem solving and code generation on recently collected or contamination-controlled programming tasks.

LiveCodeBench, BigCodeBench, code-generation leaderboards

Professional workflow executionw

1.15

Whether a model can complete economically meaningful tasks such as document analysis, spreadsheets, forms, customer workflows, financial tasks, or office-style work.

GDPval, OfficeQA, AutomationBench, tau-bench, Finance Agent

Human preference arenasw

1.10

Blind pairwise preference from users or judges, usually converted into Elo-style ratings. Good at measuring perceived helpfulness, conversational polish, and broad user taste.

LMArena / Chatbot Arena, domain-specific arenas

Weight key

Weight method

Weights stay narrow on purpose: 1.00 is a supporting signal and 1.50 is a primary signal. Preference-style evals correlate best with broad user taste, execution-based repo and agent benchmarks matter more for finished work, and cost only changes model ranking when Cost Per Task is selected.

  • When multiple priorities are selected, matching benchmark weights are averaged rather than added, so combinations stay in the 1.00-1.50 range.
  • Cost Per Task requires public pricing data. If a model has no usable public price row, it can still appear in quality tables but cannot win the cost selector.
  • No public benchmark earns a giant multiplier because rankings drift by harness, scaffold, user preference, and deployment context.

Repository issue repair1.40

Function calling and tool selection1.20

Terminal and developer-environment agents1.35

Fresh coding and contest programming1.15

Human preference arenas1.10

Long-context retrieval and memory1.25

Professional workflow execution1.15

FAQ

Quick answers

What is the most trustworthy LLM benchmark?

There is no single most trustworthy benchmark. The best benchmark is the one that closely matches the task, uses a transparent and consistent harness, is fresh enough to reduce contamination, and reports enough metadata to compare models fairly.

Should I trust vendor-published benchmark results?

Trust them as useful claims, not final evidence. Provider system cards and launch posts often have the freshest model data, but the result should be checked against benchmark-owner leaderboards, third-party runs, and local evaluations before it drives a production decision.

Why do two pages show different scores for the same model and benchmark?

Usually because the benchmark version, subset, scaffold, tools, effort level, context budget, number of attempts, or scoring rule changed. Same benchmark name does not guarantee same experiment.

What benchmark matters most for coding agents?

Use several: SWE-bench Pro for repo-level repair, Terminal-Bench for shell/tool persistence, LiveCodeBench for fresh coding ability, and local repo tasks for the work that actually matters to your team.

How should token cost be evaluated?

Use cost per completed task, not price per million tokens alone. Count input, cached input, output, reasoning tokens when billed, tool costs, retries, long-context surcharges, latency, and the cost of human correction.

Inspect first

Sources

Third-party data note: live rows come from public benchmark and pricing feeds, not internal Dreamers testing. Preview, alpha, beta, internal, research, and prototype rows are excluded before leaders are shown.