Only curated links

AI Coding Benchmark Directory

A handpicked collection of benchmark sites for comparing AI models, coding agents and real-world performance.

Benchmark

AI Coding Models

deepswe.datacurve.ai

DeepSWE

DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

Visit deepswe.datacurve.ai ->

llm-stats.com

AI Benchmarks 2026: Compare 300+ LLM Benchmarks & Tests

Compare AI and LLM benchmarks across reasoning, coding, math, vision and tool use. Every benchmark has a live leaderboard ranking 300+ models by independently verified score.

Visit llm-stats.com ->

openrouter.ai

LLM Rankings | OpenRouter

LLM rankings and AI leaderboard based on benchmarks and real usage data from millions of users. See which AI models developers actually use.

Visit openrouter.ai ->

artificialanalysis.ai

Coding Index | Artificial Analysis

Compare AI model performance on Coding Index. Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.

Visit artificialanalysis.ai ->

vals.ai

Vals AI

Private, domain-specific benchmarks in legal, tax, and finance.

Visit vals.ai ->

labs.scale.com

AI Model Leaderboards & Benchmarks

Explore leaderboards with expert-driven LLM benchmarks and updated AI model rankings across coding, reasoning and more.

Visit labs.scale.com ->

lmcouncil.ai

AI Model Benchmarks Jun 2026 | Compare GPT-5.5, Claude Opus, Gemini 3, Grok 4 | LM Council

Comprehensive AI model benchmarks from Epoch AI and Scale AI. Compare GPT-5.5, Claude Opus, Gemini 3, Grok 4, and 30+ frontier models across curated benchmarks including Humanity's Last Exam, FrontierMath, GPQA, SWE-bench, and more. Interactive comparison tool with current results.

Visit lmcouncil.ai ->

petergpt.github.io

BullshitBench: V2 (New) Viewer

Explore benchmark and evaluation details from petergpt.github.io in a focused external resource.

Visit petergpt.github.io ->

labs.scale.com

SWE-Bench Pro (Public Dataset)

Compare the resolve rates of GPT-5.4, Muse Spark, Claude Opus 4.6, and Gemini 3.1 Pro on SWE-Bench Pro. A rigorous AI software engineering benchmark for...

Visit labs.scale.com ->

swe-rebench.com

SWE-rebench Leaderboard

SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs.

Visit swe-rebench.com ->

arena.ai

Arena Leaderboard | Compare & Benchmark the Best Frontier AI Models

See how leading AI models stack up across text, image, vision, and more. This page provides a high-level snapshot of each Arena. Explore dedicated tabs for deeper insights.

Visit arena.ai ->

bridgebench.ai

Just a moment...

Explore benchmark and evaluation details from bridgebench.ai in a focused external resource.

Visit bridgebench.ai ->

Benchmark

AI Coding Agents

artificialanalysis.ai

Coding Agents Comparison: Cursor, Claude Code, GitHub Copilot, and more

Comprehensive comparison of AI coding agents including Cursor, GitHub Copilot, Cline, Continue, and more. Compare IDE extensions, proprietary IDEs, CLI tools, and cloud platforms to find the best coding assistant for your development workflow.

Visit artificialanalysis.ai ->

artificialanalysis.ai

AI Coding Agent Benchmarks & Leaderboard | Artificial Analysis

We measure real-world performance of coding agents on software engineering tasks, including cost, token usage, and execution time. We compare how performance changes across agents, models, and execution settings.

Visit artificialanalysis.ai ->

prarena.ai

PR Arena - AI Coding Agent Leaderboard

Explore benchmark and evaluation details from prarena.ai in a focused external resource.

Visit prarena.ai ->

kilo.ai

Kilo Code Alternatives — Cursor, Copilot, Windsurf & More Compared

Looking for a Cursor, Copilot, or Windsurf alternative? See how Kilo Code compares to the top AI coding assistants — open source, 500+ models, zero markup, BYOK everywhere.

Visit kilo.ai ->

10xbench.ai

10xBench — LLM Coding Benchmark in Astro, React, Tailwind & Cloudflare

See how leading LLMs perform when each gets a single shot at building a real production website in Astro, React, Tailwind and Cloudflare. Compare scores, screenshots and generated code side by side.

Visit 10xbench.ai ->