LLM Benchmarks 2026 - Compare AI Benchmarks and Tests
Explore LLM benchmarks and AI benchmarks to compare models across reasoning, coding, math, and more independently verified.
Only curated links
A handpicked collection of benchmark sites for comparing AI models, coding agents and real-world performance.
Benchmark
Explore LLM benchmarks and AI benchmarks to compare models across reasoning, coding, math, and more independently verified.
LLM rankings and AI leaderboard based on benchmarks and real usage data from millions of users. See which AI models developers actually use.
Compare AI model performance on Coding Index. Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.
Private, domain-specific benchmarks in legal, tax, and finance.
Explore leaderboards with expert-driven LLM benchmarks and updated AI model rankings across coding, reasoning and more.
Comprehensive AI model benchmarks from Epoch AI and Scale AI. Compare GPT-5, Claude Opus 4, Gemini 2.5 Pro, Grok 4, and 30+ frontier models across 20 benchmarks including Humanity's Last Exam, FrontierMath, GPQA, SWE-bench, and more. Interactive comparison tool with live results.
Explore benchmark and evaluation details from petergpt.github.io in a focused external resource.
Compare the resolve rates of GPT-5.4, Muse Spark, Claude Opus 4.6, and Gemini 3.1 Pro on SWE-Bench Pro. A rigorous AI software engineering benchmark for...
SWE-rebench: A Continuously Evolving and Decontaminated Benchmark for Software Engineering LLMs.
See how leading AI models stack up across text, image, vision, and more. This page provides a high-level snapshot of each Arena. Explore dedicated tabs for deeper insights.
Benchmark
Comprehensive comparison of AI coding agents including Cursor, GitHub Copilot, Cline, Continue, and more. Compare IDE extensions, proprietary IDEs, CLI tools, and cloud platforms to find the best coding assistant for your development workflow.
Explore benchmark and evaluation details from prarena.ai in a focused external resource.
Looking for a Cursor, Copilot, or Windsurf alternative? See how Kilo Code compares to the top AI coding assistants — open source, 500+ models, zero markup, BYOK everywhere.
Compare LLM attempts at vibe coding the Przeprogramowani.pl website