OpenAI has rolled out a new benchmarking system aimed at measuring how effectively artificial intelligence agents can identify and repair security flaws in crypto smart contracts.
Announced on February 18, the system called EVMbench was developed in collaboration with crypto investment firm Paradigm. It focuses on applications built for the Ethereum Virtual Machine (EVM), the engine that powers many blockchain-based financial applications.
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
OpenAI said the tool was designed to reflect real-world financial risks, noting that smart contracts currently safeguard more than $100 billion in open-source crypto assets. As AI systems grow more advanced, the company said, it is critical to evaluate how they perform in high-stakes security environments.
Benchmarking AI against real-world vulnerabilities
EVMbench tests AI agents across three core functions: detecting vulnerabilities, patching flawed code and executing simulated exploits. The dataset includes 120 high-severity issues drawn from 40 previous security audits, many sourced from public audit competitions.
Additional case studies were incorporated from reviews of the Tempo blockchain, a payments-focused network built for stablecoin transactions, to better reflect financial use cases.
To simulate attacks safely, OpenAI adapted existing exploit scripts and built new ones where necessary. All tests are conducted in isolated environments, and only previously disclosed vulnerabilities are used.
In detection mode, AI models analyze contract code to flag known weaknesses. In patch mode, they must correct those flaws without disrupting functionality. In exploit mode, agents attempt to drain funds from vulnerable contracts within a controlled sandbox.
Early results highlight strengths and limits
Using a custom-built evaluation framework, OpenAI tested several advanced models. In exploit scenarios, GPT-5.3-Codex scored 72.2%, a sharp increase from GPT-5’s 31.9% result six months earlier. However, detection and patching tasks proved more challenging, underscoring the complexity of nuanced code reviews. In 2025, OpenAI rolled back the GPT-4o update after ChatGPT became overly agreeable.
Researchers found that AI agents performed best when objectives were explicit, such as extracting funds, but struggled with broader analytical tasks involving large codebases.
The company said the benchmark is intended to strengthen defensive cybersecurity efforts. Alongside the release, OpenAI pledged $10 million in API credits to support open-source security initiatives and made EVMbench’s tools and datasets publicly available for further research.
Enjoyed this piece? Bookmark DeFi Planet, explore related topics, and follow us on Twitter, LinkedIn, Facebook, Instagram, Threads and CoinMarketCap Community for seamless access to high-quality industry insights.
“Take control of your crypto portfolio with MARKETS PRO, DeFi Planet’s suite of analytics tools.”






































































































