Introducing SWE-bench Verified

OpenAI has released a human-reviewed version of SWE-bench, a benchmark that evaluates AI models' ability to solve real software engineering problems. The validated subset provides more dependable performance metrics by incorporating expert verification. This effort helps ensure that AI coding tools are assessed fairly and accurately against practical, real-world challenges.

OpenAI has unveiled a carefully curated version of SWE-bench, a benchmark designed to measure how effectively artificial intelligence systems can tackle authentic software engineering challenges. This new human-verified subset aims to provide more dependable assessments of AI model performance in real-world coding scenarios.

The original SWE-bench platform serves as a testing ground for evaluating whether AI assistants can successfully resolve legitimate bugs and implement requested features in actual codebases. However, like many large-scale benchmarks, it can suffer from inconsistencies in how problems are framed or validated. The newly released validated version addresses these limitations by incorporating human expert review.

Why this matters: As AI coding tools become increasingly sophisticated, the need for rigorous and trustworthy evaluation methods grows more critical. A benchmark that has undergone human validation provides developers and researchers with greater confidence when comparing different AI models' capabilities. This transparency helps ensure that performance metrics truly reflect practical utility rather than gaming the evaluation system.

The human-verified subset represents OpenAI's commitment to maintaining high standards for AI benchmarking. By filtering SWE-bench through expert validation, the company is working to establish more reliable baselines that can meaningfully guide future development in AI-assisted software engineering.

Source: OpenAI News

Bombarding gamblers with offers greatly increases betting and gambling harm

University of Bristol researchers have found that frequent promotional offers from gambling operators significantly increase betting activity and associated harms among users. The study suggests that current marketing regulations fail to adequately protect consumers from aggressive promotional tactics. Policymakers are being called to consider stricter controls on gambling advertising frequency and personalization.

The Download: Quantum computing for health, and why the world doesn’t recycle more nuclear waste

A $5 million prize is motivating researchers to prove quantum computers can solve healthcare challenges, with practical experiments already underway at facilities like Oxford. Meanwhile, the nuclear industry continues to struggle with recycling spent fuel, demonstrating how technological capability doesn't always translate into widespread adoption when facing regulatory and economic obstacles.

Can quantum computers now solve health care problems? We’ll soon find out.

Researchers at the UK's National Quantum Computing Centre are testing a quantum computer built from suspended cesium atoms and light, exploring whether the technology can finally solve complex healthcare problems. The machine's ability to process information in fundamentally different ways from classical computers could accelerate drug discovery and personalized medicine. The coming period will be critical in determining whether quantum computing can move from laboratory success to practical medical breakthroughs.

Related Articles

Bombarding gamblers with offers greatly increases betting and gambling harm

The Download: Quantum computing for health, and why the world doesn’t recycle more nuclear waste

Can quantum computers now solve health care problems? We’ll soon find out.