Video Coding Benchmarks

12d

How Anthropic’s Fable 5 Beat ChatGPT 5.5 by 20% in Coding Benchmarks

Anthropic has launched Claude Fable 5, a Mythos-class AI model that outperforms GPT 5.5 in coding and vision tasks despite ...

27d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and benchmark leakage.

i-SCOOP

Kimi K2.7 Code, the open-weight coding model that thinks 30% less

A deep dive into Kimi K2.7 Code from Moonshot AI: architecture, benchmarks, pricing, and how to put its 256K context and ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

Geeky Gadgets

Anthropic Claude Opus 4.5 Tops Coding Benchmarks While Slashing Token Use

What if the future of coding wasn’t human, but instead powered by an AI so advanced it could outpace even the most skilled developers? Enter Claude Opus 4.5, a model that doesn’t just assist with ...

Bleeping Computer

Grok 4 benchmark results: Tops math, ranks second in coding

Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. LMArena.ai, which is an ...

GIGAZINE

DeepSWE is a benchmark that prevents cheating using coding AI and allows for more accurate measurement of programming performance.

In recent years, it has become common for developers to use coding AI in software development, and various benchmarks exist to measure the performance of coding AI. Now, a new benchmark called ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results