Only curated links

AI Coding Benchmark Directory

A handpicked collection of benchmark sites for comparing AI models, coding agents and real-world performance.

Benchmark

AI Coding Models

LLM Benchmarks 2026 - Compare AI Benchmarks and Tests page snapshot
llm-stats.com

LLM Benchmarks 2026 - Compare AI Benchmarks and Tests

Explore LLM benchmarks and AI benchmarks to compare models across reasoning, coding, math, and more independently verified.

Visit llm-stats.com ->
LLM Rankings | OpenRouter page snapshot
openrouter.ai

LLM Rankings | OpenRouter

LLM rankings and AI leaderboard based on benchmarks and real usage data from millions of users. See which AI models developers actually use.

Visit openrouter.ai ->
Coding Index | Artificial Analysis page snapshot
artificialanalysis.ai

Coding Index | Artificial Analysis

Compare AI model performance on Coding Index. Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.

Visit artificialanalysis.ai ->
Vals AI page snapshot
vals.ai

Vals AI

Private, domain-specific benchmarks in legal, tax, and finance.

Visit vals.ai ->
AI Model Leaderboards & Benchmarks page snapshot
labs.scale.com

AI Model Leaderboards & Benchmarks

Explore leaderboards with expert-driven LLM benchmarks and updated AI model rankings across coding, reasoning and more.

Visit labs.scale.com ->
AI Model Benchmarks May 2026 | Compare GPT-5, Claude 4.5, Gemini 2.5, Grok 4 | LM Council page snapshot
lmcouncil.ai

AI Model Benchmarks May 2026 | Compare GPT-5, Claude 4.5, Gemini 2.5, Grok 4 | LM Council

Comprehensive AI model benchmarks from Epoch AI and Scale AI. Compare GPT-5, Claude Opus 4, Gemini 2.5 Pro, Grok 4, and 30+ frontier models across 20 benchmarks including Humanity's Last Exam, FrontierMath, GPQA, SWE-bench, and more. Interactive comparison tool with live results.

Visit lmcouncil.ai ->
BullshitBench: V2 (New) Viewer page snapshot
petergpt.github.io

BullshitBench: V2 (New) Viewer

Explore benchmark and evaluation details from petergpt.github.io in a focused external resource.

Visit petergpt.github.io ->
SWE-Bench Pro (Public Dataset) page snapshot
labs.scale.com

SWE-Bench Pro (Public Dataset)

Compare the resolve rates of GPT-5.4, Muse Spark, Claude Opus 4.6, and Gemini 3.1 Pro on SWE-Bench Pro. A rigorous AI software engineering benchmark for...

Visit labs.scale.com ->
SWE-rebench Leaderboard page snapshot
swe-rebench.com

SWE-rebench Leaderboard

SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs.

Visit swe-rebench.com ->
Arena Leaderboard | Compare & Benchmark the Best Frontier AI Models page snapshot
arena.ai

Arena Leaderboard | Compare & Benchmark the Best Frontier AI Models

See how leading AI models stack up across text, image, vision, and more. This page provides a high-level snapshot of each Arena. Explore dedicated tabs for deeper insights.

Visit arena.ai ->

Benchmark

AI Coding Agents