AI Model Evaluation
LLM Model Benchmarks
Which evals matter, what they measure, and how to map benchmark evidence to coding, law, finance, automation, security, speed, and cost per task.
Updated 2026-06-01
Benchmark map
All benchmark types
A compact map of the evals worth knowing before comparing Claude, ChatGPT, Gemini, open-weight models, or any new frontier release.
Software engineering and agents
Codebase repair, fresh coding, terminal work, tool selection, and browser or OS agents.
Workflow automation and business work
Professional tasks, research, RAG, documents, and long-context operating work.
Expert domains
Law, finance, healthcare, math, and science-heavy model selection.
Risk, preference, and operations
Security, safety, live general quality, human preference, speed, and cost.
Selector
Pick the work, then pick the benchmarks
Choose the outcomes that matter. The selector returns the benchmark families that should carry the most weight.
Selected work winner
Select priorities
Select one or more priorities to rank models against the benchmark mix that matters for that work.
Choose priorities
Most relevant benchmark families
Repository issue repairw
1.40Whether an AI agent can understand a real codebase, modify files, and pass tests for a real issue or enterprise-like software task.
SWE-bench, SWE-bench Verified, SWE-bench Pro, Multi-SWE-bench
Terminal and developer-environment agentsw
1.35Whether a model can operate in a shell, inspect files, run tools, debug failures, and persist through multi-step technical tasks.
Terminal-Bench, terminal-based agent evals
Long-context retrieval and memoryw
1.25Whether a model can use long inputs, retrieve details, connect distant facts, and avoid losing important context.
RULER, InfiniteBench, GraphWalks, needle-in-haystack variants
Function calling and tool selectionw
1.20Whether the model can choose the right tool, format calls correctly, chain calls, handle parallel calls, and recover from tool outputs.
Berkeley Function-Calling Leaderboard, MCP Atlas, Toolathlon
Fresh coding and contest programmingw
1.15Algorithmic problem solving and code generation on recently collected or contamination-controlled programming tasks.
LiveCodeBench, BigCodeBench, code-generation leaderboards
Professional workflow executionw
1.15Whether a model can complete economically meaningful tasks such as document analysis, spreadsheets, forms, customer workflows, financial tasks, or office-style work.
GDPval, OfficeQA, AutomationBench, tau-bench, Finance Agent
Human preference arenasw
1.10Blind pairwise preference from users or judges, usually converted into Elo-style ratings. Good at measuring perceived helpfulness, conversational polish, and broad user taste.
LMArena / Chatbot Arena, domain-specific arenas
Weight key
Weight method
Weights stay narrow on purpose: 1.00 is a supporting signal and 1.50 is a primary signal. Preference-style evals correlate best with broad user taste, execution-based repo and agent benchmarks matter more for finished work, and cost only changes model ranking when Cost Per Task is selected.
- When multiple priorities are selected, matching benchmark weights are averaged rather than added, so combinations stay in the 1.00-1.50 range.
- Cost Per Task requires public pricing data. If a model has no usable public price row, it can still appear in quality tables but cannot win the cost selector.
- No public benchmark earns a giant multiplier because rankings drift by harness, scaffold, user preference, and deployment context.
Repository issue repair1.40
Function calling and tool selection1.20
Terminal and developer-environment agents1.35
Fresh coding and contest programming1.15
Human preference arenas1.10
Long-context retrieval and memory1.25
Professional workflow execution1.15
FAQ
Quick answers
What is the most trustworthy LLM benchmark?
There is no single most trustworthy benchmark. The best benchmark is the one that closely matches the task, uses a transparent and consistent harness, is fresh enough to reduce contamination, and reports enough metadata to compare models fairly.
Should I trust vendor-published benchmark results?
Trust them as useful claims, not final evidence. Provider system cards and launch posts often have the freshest model data, but the result should be checked against benchmark-owner leaderboards, third-party runs, and local evaluations before it drives a production decision.
Why do two pages show different scores for the same model and benchmark?
Usually because the benchmark version, subset, scaffold, tools, effort level, context budget, number of attempts, or scoring rule changed. Same benchmark name does not guarantee same experiment.
What benchmark matters most for coding agents?
Use several: SWE-bench Pro for repo-level repair, Terminal-Bench for shell/tool persistence, LiveCodeBench for fresh coding ability, and local repo tasks for the work that actually matters to your team.
How should token cost be evaluated?
Use cost per completed task, not price per million tokens alone. Count input, cached input, output, reasoning tokens when billed, tool costs, retries, long-context surcharges, latency, and the cost of human correction.
Inspect first
Sources
- Artificial Analysis API Reference
- BenchLM public leaderboard endpoint
- BenchLM public pricing endpoint
- Models.dev model database
- LMArena Leaderboard
- Chatbot Arena human preference paper
- MT-Bench and Chatbot Arena judge paper
- AlpacaEval length-controlled correlation notes
- LiveBench paper
- LiveCodeBench repository
- SWE-bench repository
- SWE-bench Pro by Scale AI
- Terminal-Bench 2.1
- Saving SWE-Bench realistic agent evaluation
- Berkeley Function-Calling Leaderboard
- OSWorld computer-use agent paper
- GDPval real-world work benchmark
- Personalized benchmark preference divergence
- Leaderboard Illusion benchmark audit
- LegalBench
- LegalBench-RAG
- FinanceBench
- Open FinLLM Leaderboard paper
- HealthBench by OpenAI
- CVE-Bench
- Cybench by Epoch AI
- OpenAI note on SWE-bench Verified saturation
Third-party data note: live rows come from public benchmark and pricing feeds, not internal Dreamers testing. Preview, alpha, beta, internal, research, and prototype rows are excluded before leaders are shown.