LLM Model Benchmarks

Benchmark map

All benchmark types

A compact map of the evals worth knowing before comparing Claude, ChatGPT, Gemini, open-weight models, or any new frontier release.

Software engineering and agents

Codebase repair, fresh coding, terminal work, tool selection, and browser or OS agents.

BenchmarkForSignal

Workflow automation and business work

Professional tasks, research, RAG, documents, and long-context operating work.

BenchmarkForSignal

Expert domains

Law, finance, healthcare, math, and science-heavy model selection.

BenchmarkForSignal

Risk, preference, and operations

Security, safety, live general quality, human preference, speed, and cost.

BenchmarkForSignal

Selector

Pick the work, then pick the benchmarks

Choose the outcomes that matter. The selector returns the benchmark families that should carry the most weight.

Selected work winner

Select priorities

Select one or more priorities to rank models against the benchmark mix that matters for that work.

Choose priorities

Most relevant benchmark families

Repository issue repair^w

1.40

Whether an AI agent can understand a real codebase, modify files, and pass tests for a real issue or enterprise-like software task.

SWE-bench, SWE-bench Verified, SWE-bench Pro, Multi-SWE-bench

Terminal and developer-environment agents^w

1.35

Whether a model can operate in a shell, inspect files, run tools, debug failures, and persist through multi-step technical tasks.

Terminal-Bench, terminal-based agent evals

Long-context retrieval and memory^w

1.25

Whether a model can use long inputs, retrieve details, connect distant facts, and avoid losing important context.

RULER, InfiniteBench, GraphWalks, needle-in-haystack variants

Function calling and tool selection^w

1.20

Whether the model can choose the right tool, format calls correctly, chain calls, handle parallel calls, and recover from tool outputs.

Berkeley Function-Calling Leaderboard, MCP Atlas, Toolathlon

Fresh coding and contest programming^w

1.15

Algorithmic problem solving and code generation on recently collected or contamination-controlled programming tasks.

LiveCodeBench, BigCodeBench, code-generation leaderboards

Professional workflow execution^w

1.15

Whether a model can complete economically meaningful tasks such as document analysis, spreadsheets, forms, customer workflows, financial tasks, or office-style work.

GDPval, OfficeQA, AutomationBench, tau-bench, Finance Agent

Human preference arenas^w

1.10

Blind pairwise preference from users or judges, usually converted into Elo-style ratings. Good at measuring perceived helpfulness, conversational polish, and broad user taste.

LMArena / Chatbot Arena, domain-specific arenas

Weight key

Weight method

Weights stay narrow on purpose: 1.00 is a supporting signal and 1.50 is a primary signal. Preference-style evals correlate best with broad user taste, execution-based repo and agent benchmarks matter more for finished work, and cost only changes model ranking when Cost Per Task is selected.

When multiple priorities are selected, matching benchmark weights are averaged rather than added, so combinations stay in the 1.00-1.50 range.
Cost Per Task requires public pricing data. If a model has no usable public price row, it can still appear in quality tables but cannot win the cost selector.
No public benchmark earns a giant multiplier because rankings drift by harness, scaffold, user preference, and deployment context.

Repository issue repair1.40

Function calling and tool selection1.20

Terminal and developer-environment agents1.35

Fresh coding and contest programming1.15

Human preference arenas1.10

Long-context retrieval and memory1.25

Professional workflow execution1.15

FAQ

Quick answers

What is the most trustworthy LLM benchmark?

There is no single most trustworthy benchmark. The best benchmark is the one that closely matches the task, uses a transparent and consistent harness, is fresh enough to reduce contamination, and reports enough metadata to compare models fairly.

Should I trust vendor-published benchmark results?

Trust them as useful claims, not final evidence. Provider system cards and launch posts often have the freshest model data, but the result should be checked against benchmark-owner leaderboards, third-party runs, and local evaluations before it drives a production decision.

Why do two pages show different scores for the same model and benchmark?

Usually because the benchmark version, subset, scaffold, tools, effort level, context budget, number of attempts, or scoring rule changed. Same benchmark name does not guarantee same experiment.

What benchmark matters most for coding agents?

Use several: SWE-bench Pro for repo-level repair, Terminal-Bench for shell/tool persistence, LiveCodeBench for fresh coding ability, and local repo tasks for the work that actually matters to your team.

How should token cost be evaluated?

Use cost per completed task, not price per million tokens alone. Count input, cached input, output, reasoning tokens when billed, tool costs, retries, long-context surcharges, latency, and the cost of human correction.

Inspect first

Sources

Third-party data note: live rows come from public benchmark and pricing feeds, not internal Dreamers testing. Preview, alpha, beta, internal, research, and prototype rows are excluded before leaders are shown.

All benchmark types

Pick the work, then pick the benchmarks

Repository issue repairw

Terminal and developer-environment agentsw

Long-context retrieval and memoryw

Function calling and tool selectionw

Fresh coding and contest programmingw

Professional workflow executionw

Human preference arenasw

Quick answers

What is the most trustworthy LLM benchmark?

Should I trust vendor-published benchmark results?

Why do two pages show different scores for the same model and benchmark?

What benchmark matters most for coding agents?

How should token cost be evaluated?

Sources

Repository issue repair^w

Terminal and developer-environment agents^w

Long-context retrieval and memory^w

Function calling and tool selection^w

Fresh coding and contest programming^w

Professional workflow execution^w

Human preference arenas^w