While artificial intelligence has shown significant progress on standard software engineering benchmarks over recent years, most existing evaluations focus primarily on short-term tasks such as implementing minor features or resolving simple bugs. To address the need for evaluating AI on complex, end-to-end development, Epoch AI, in collaboration with METR, has officially introduced MirrorCode. This novel benchmark is specially designed for evaluating the limits of AI models on long-horizon and autonomous code generation tasks.
A typical evaluation procedure for MirrorCode requires that the AI model completely reproduce a given program from scratch while having no access to its original source code. To succeed, the AI-generated software must perfectly replicate the original program’s behavior across a rigorous suite of end-to-end tests, including hidden, held-out evaluation datasets. The benchmark features 25 target programs spanning diverse areas of computer science, including Unix utilities, bioinformatics, cryptography, data serialization, interpreters, compression, and static analysis.
Redefining AI Evaluation Through Scale and Rigor
MirrorCode distinguishes itself from traditional code-generation benchmarks through several core methodologies:
- Scale-Informed Inference Budgets: Contrasting typical benchmarks which limit AI inference costs to $1-$10 per task even for those tasks that would have taken weeks of human effort, MirrorCode gives the model enough money and compute budget to perform difficult reasoning tasks. In one of the biggest experiments of the benchmark, the AI model ran autonomously for 19 days with no human interference and incurred a single-run inference cost of $2,600.
- Feasible but Difficult Tasks: Completely rebuilding software programs is such a great barrier for human programmers that it might take months of individual work for the hardest tasks. Nevertheless, MirrorCode ensures fairness by making sure that each task has enough context for being solvable.
- Hard-to-Cheat Sandboxing: For maintaining the highest level of evaluation integrity, AI models are run in sandbox environments which do not have access to the internet or original code bases. Since the models don’t have access to the unseen end-to-end validation data during training, they can’t cheat by hardcoding a lookup table of the program’s outputs.
Also Read: Tigera Unveils Lynx: A Unified Control Plane for Kubernetes-Native AI Agents
Current AI Models Demonstrate Autonomous Capabilities, Room for Growth Remains
Early benchmark results indicate that frontier AI models possess the capability to execute complex, long-horizon software engineering tasks autonomously. During testing, Claude Opus 4.7 successfully recreated gotree a bioinformatics toolkit comprising roughly 16,000 lines of Go code and over 40 distinct commands in just 14 hours at a cost of $251. By contrast, an unaided human engineer will need somewhere between 2 and 17 weeks to accomplish the same task.
With all this said, it turns out that MirrorCode shows how far autonomous software engineering is yet from being cracked. Claude Opus 4.7 earned its best headline accuracy rating at 56%. Other tested frontier models included GPT-5.5, which recorded a 44% solve rate, and Gemini 3.1 Pro Preview at 32%. Across 21 of the 25 target programs, AI models succeeded in passing at least 99% of the validation tests in at least one attempt, though 8 of the strictest 100%-pass targets have yet to be solved in any run.
The data also reveals a rapid trajectory of capability gains; top-tier models from the previous year averaged a solve rate of only 30% and were strictly limited to simpler applications, such as basic calendar utilities. Interestingly, inference cost trends varied by provider; GPT-5.5 cost three times more than GPT-5 to complete identical tasks, whereas Claude Opus 4.7 operated at a threefold cost reduction compared to Claude Opus 4.1.
Addressing Data Contamination and Generalization
A critical consideration in benchmarking open-source software replication is data contamination, as AI models may have encountered the original target codebases during their initial pretraining phases. To counter this, Epoch AI implemented a dedicated memorization screen. Testing showed that AI models successfully reimplemented several target programs that passed the screening process, while failing on programs that exhibited signs of memorization. This indicates that benchmark performance is driven by genuine problem-solving capabilities rather than mere rote memorization, suggesting that these autonomous software engineering skills will generalize effectively to entirely novel, unseen codebases.
Open-Source Availability
To foster collaborative industry advancement, Epoch AI has open-sourced its evaluation scaffold alongside 22 of the 25 MirrorCode target programs. This release encompasses 132 distinct task instances across six supported programming languages. The remaining three target programs are withheld as a private, secure test set for future model evaluations.
This research initiative was co-developed with METR and made possible through a grant provided by METR. The principal authors of the MirrorCode benchmark are Tom Adamczewski, David Owen, and David Rein. Additional target programs were contributed by Florian Brand, Giles Edkins, Allen Hart, and Daniel O’Connell, with core infrastructure optimizations and engineering guidance provided by Rasmus Faber-Espensen.


