Why we no longer evaluate SWE-bench Verified

OpenAI has discontinued reliance on SWE-bench Verified due to data contamination, flawed tests, and training leakage that distort measurements of AI coding progress. The organization recommends switching to SWE-bench Pro as a more reliable alternative for accurately evaluating frontier software engineering capabilities.

OpenAI has decided to stop using SWE-bench Verified as a reliable measurement tool for evaluating advanced coding capabilities. The decision stems from growing concerns about the benchmark's integrity and its ability to accurately reflect real progress in AI-assisted software engineering.

The core problem: data contamination

Extensive analysis revealed that SWE-bench Verified suffers from significant contamination issues that undermine its validity as an assessment metric. The benchmark contains flawed test cases that don't reliably measure what they're intended to measure. Additionally, researchers discovered evidence of training data leakage—meaning models may have encountered similar problems during their training phase, allowing them to perform well without genuinely solving novel challenges.

What this means for the field

These findings suggest that performance improvements observed on SWE-bench Verified may not represent authentic advances in frontier coding capabilities. Instead, they could reflect a model's familiarity with contaminated test data rather than genuine improvements in software engineering abilities.

A recommended alternative

In light of these limitations, OpenAI recommends that researchers and developers transition to SWE-bench Pro as a more trustworthy evaluation framework. This alternative benchmark is designed to address the shortcomings that plague the Verified version, offering a cleaner, more reliable method for assessing the true capabilities of coding-focused AI systems.

Source: OpenAI News

Bombarding gamblers with offers greatly increases betting and gambling harm

University of Bristol researchers have found that frequent promotional offers from gambling operators significantly increase betting activity and associated harms among users. The study suggests that current marketing regulations fail to adequately protect consumers from aggressive promotional tactics. Policymakers are being called to consider stricter controls on gambling advertising frequency and personalization.

The Download: Quantum computing for health, and why the world doesn’t recycle more nuclear waste

A $5 million prize is motivating researchers to prove quantum computers can solve healthcare challenges, with practical experiments already underway at facilities like Oxford. Meanwhile, the nuclear industry continues to struggle with recycling spent fuel, demonstrating how technological capability doesn't always translate into widespread adoption when facing regulatory and economic obstacles.

Can quantum computers now solve health care problems? We’ll soon find out.

Researchers at the UK's National Quantum Computing Centre are testing a quantum computer built from suspended cesium atoms and light, exploring whether the technology can finally solve complex healthcare problems. The machine's ability to process information in fundamentally different ways from classical computers could accelerate drug discovery and personalized medicine. The coming period will be critical in determining whether quantum computing can move from laboratory success to practical medical breakthroughs.

Related Articles

Bombarding gamblers with offers greatly increases betting and gambling harm

The Download: Quantum computing for health, and why the world doesn’t recycle more nuclear waste

Can quantum computers now solve health care problems? We’ll soon find out.