Awesome-Code-Benchmark
Awesome-Code-Benchmark copied to clipboard
A comprehensive code domain benchmark review of LLM researches.
A comprehensive code domain benchmark review of LLM researches.
News
-
🔥🔥 [2025-09-22] Featured Benchmarks:
🔥LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering from Salesforce AI Research
🔥CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects from Ant Group
-
🔥🔥 [2025-08-29] Featured Benchmarks:
🔥A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code from Tencent
-
🔥🔥 [2025-08-22] Featured Benchmarks:
🔥TRACY: Benchmarking Execution Efficiency of LLM-Based Code Translation from Peking University
🔥BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models from University of Science and Technology of China
-
🔥🔥 [2025-08-16] Featured Benchmarks:
🔥AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators from Hunyuan Team, Tencent
🔥Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes from Beihang University
🔥STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning from ByteDance
-
🔥🔥 [2025-07-23] Featured Benchmarks:
🔥SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? from Xi’an Jiaotong University and TikTok
🔥CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks from ASUS Intelligent Cloud Services
🔥Multilingual Multimodal Software Developer for Code Generation from Beihang University
🔥CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance from Amazon Web Service
🔥SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks from SberAI
🔥IFEvalCode: Controlled Code Generation from Beihang University
🔥Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security from Government Technology Agency
🔥MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts? from University of Illinois Urbana-Champaign
🔥Turning the Tide: Repository-based Code Reflection from Beihang University
-
🔥🔥 [2025-07-13] Featured Benchmarks:
🔥CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks from Purdue University
🔥ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation from Tencent Hunyuan Team
🔥CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark from Shanghai Jiao Tong University
🔥Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs from Provable Responsible AI and Data Analytics (PRADA) Lab
🔥Model Editing for LLMs4Code: How Far are We? from National University of Defense Technology
🔥VeriBench: Benchmarking Large Language Models for Verilog Code Generation and Design Synthesis from Indian Institute Of Technology Gandhinagar
🔥ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness from Imperial College London United Kingdom
🔥Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation from Chinese Academy of Sciences
- [2025-04-18] We add Github Stars for each banchmark.
- [2025-04-13] We add Code Security & Robustness benchmarks.
- [2025-04-06] We add Code Hallucinations benchmarks.
- [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.
- [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.
- [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.

Table of Content
- Code Completion & Code Generation
- Code Efficiency
- CodeFix & Bug-Fix
- Code Reasoning & Understanding
- Code Hallucination
- Data science
- Text2SQL
- MultiModal Code Tasks
- Code Security & Robustness
- Code Translation
- Code Version
- Multi & Other Dimension
- Industry Code Generation
Survey
-
Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xi’an Jiaotong University
-
Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks from Zhejiang University
-
A Survey on Large Language Model Benchmarks from Shenzhen Key Laboratory for High Performance Data Mining
🚀 Top Code Benchmark
Code Completion & Code Generation
Code Efficiency
CodeFix & Bug-Fix
Code Reasoning & Understanding
Code Hallucination
| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|---|---|---|---|---|
| HALLUCODE | Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | Arxiv 2024/04 | ||
| Collu-Bench | Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code | Arxiv 2024/10 | 🤗Dataset | |
| CodeHalu | CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification | AAAI 2025 | Github |
🤗Dataset |
| APIHulBench | Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware | FSE 25 | Github |
|
| THINK | THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge | SANER 2025 | Github |
🤗Dataset |
Data science
| Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
|---|---|---|---|---|
| DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github |
🤗Dataset 🌐HomePage |
| ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github |
Dataset |
| DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github |
🤗Dataset 🌐Website |
| MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github |
🤗Dataset |
| DataSciBench | DataSciBench: An LLM Agent Benchmark for Data Science | ArXiv 2025/02 | Github |
|
| DSBench | DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? | ICLR 2025 | Github |
🤗Dataset |
| DSCodeBench | DS-Bench: A Realistic Benchmark for Data Science Code Generation | Arxiv 2025/05 | Github |