SWE-bench
SWE-bench copied to clipboard
Failed to apply patch to container
Describe the bug
When I was expanding the data, I often encountered cases where even the gold patch failed to apply.
In SWE-bench, it attempts the following three apply commands in order:
GIT_APPLY_CMDS = [
"git apply --verbose",
"git apply --verbose --reject",
"patch --batch --fuzz=5 -p1 -i",
]
A "failed apply" means that all three commands failed to apply the patch.
Is there some syntactic or formatting difference here? Could it be that patches from different time periods use different formats and therefore require different apply commands?
Steps/Code to Reproduce
Following the SWE-bench data collection pipeline, I used the gold patch for validation.
Expected Results
This gold patch was obtained using SWE-bench’s data collection pipeline. In theory, these patches are from merged code and should not result in a failed apply.
Actual Results
However, the following error occurred:
2025-04-20 12:11:30,583 - INFO - Failed to apply patch to container: git apply --verbose
2025-04-20 12:11:30,613 - INFO - Failed to apply patch to container: git apply --verbose --reject
2025-04-20 12:11:30,646 - INFO - Failed to apply patch to container: patch --batch --fuzz=5 -p1 -i
2025-04-20 12:11:30,646 - INFO - >>>>> Patch Apply Failed:
The next patch would delete the file Makefile,
which does not exist! Assuming -R.
patching file Makefile
patching file README.md
Reversed (or previously applied) patch detected! Assuming -R.
patching file README.rst
Reversed (or previously applied) patch detected! Assuming -R.
patching file build/lib/fbbotw/__init__.py
patching file build/lib/fbbotw/fbbotw.py
Reversed (or previously applied) patch detected! Assuming -R.
patching file dist/fbbotw-1.0-py2.py3-none-any.whl
patching file dist/fbbotw-1.0.tar.gz
patching file docs/build/doctrees/environment.pickle
patching file docs/build/doctrees/index.doctree
patching file docs/build/html/_sources/index.txt
Reversed (or previously applied) patch detected! Assuming -R.
patching file docs/build/html/genindex.html
Reversed (or previously applied) patch detected! Assuming -R.
patching file docs/build/html/index.html
Reversed (or previously applied) patch detected! Assuming -R.
patching file docs/build/html/objects.inv
patching file docs/build/html/searchindex.js
Reversed (or previously applied) patch detected! Assuming -R.
patching file docs/source/functions.rst
Reversed (or previously applied) patch detected! Assuming -R.
patching file docs/source/index.rst
Reversed (or previously applied) patch detected! Assuming -R.
patching file fbbotw.egg-info/PKG-INFO
Hunk #1 FAILED at 1.
Hunk #2 FAILED at 97.
Hunk #3 FAILED at 136.
3 out of 3 hunks FAILED -- saving rejects to file fbbotw.egg-info/PKG-INFO.rej
patching file fbbotw/fbbotw.py
Reversed (or previously applied) patch detected! Assuming -R.
patching file setup.py
Reversed (or previously applied) patch detected! Assuming -R.
System Information
No response
repair_patch
Does the gold patch obtained through the data collection pipeline require further processing in order to apply successfully?
What does repair_patch do, and can it help resolve this issue?
@lycfight I ran python swebench/harness/run_evaluation.py --predictions_path gold --dataset_name SWE-bench/SWE-bench - didn't see any cases where instances failed due to the gold patch not applying.
What specific instance ID's are you encountering this for? How often are you seeing this? With Modal, 300/300 Lite and 495/500 Verified instances are resolved with gold. Locally, I got 297 and 493. I didn't observe gold patch apply failures for the unresolved ones.
@lycfight I ran
python swebench/harness/run_evaluation.py --predictions_path gold --dataset_name SWE-bench/SWE-bench- didn't see any cases where instances failed due to the gold patch not applying.What specific instance ID's are you encountering this for? How often are you seeing this? With Modal, 300/300 Lite and 495/500 Verified instances are resolved with gold. Locally, I got 297 and 493. I didn't observe gold patch apply failures for the unresolved ones.
Thanks for the reply. I'd like to add that "Patch Apply Failed" occurs in the following two situations:
- When using some agent frameworks + LLMs to generate patches.
- When extending new repos following
collection.mdand applying the gold patch for validation.
In case 1, the failure might be due to formatting issues in the generated patch, which is understandable. However, in case 2, the patch is extracted from a commit that has already been merged—why would it result in a "Patch Apply Failed"? Are the apply commands used in SWE-bench fully equivalent to applying a patch from a commit that has already been merged on GitHub?
I expanded a batch of data and collected the instances where the gold patch failed to apply during validation into a dataset:
zengliangcs/SWE-Ours-temp
For the constants configuration, I used a default setup:
from collections import defaultdict
TEST_PYTEST_WO_DEPRECATION = (
"pytest --no-header -rA --tb=no -p no:cacheprovider -W ignore::DeprecationWarning"
)
SPECS_PLACEHOLDER = {
"-1.0": {
"python": "3.9",
"packages": "requirements.txt",
"pip_packages": [
"pytest",
"distro",
"pytest-cov",
"pytest-xdist",
"pytest-mock",
"pytest-asyncio",
"pytest-bdd",
"pytest-benchmark",
"pytest-randomly",
"responses",
"mock",
"hypothesis",
"freezegun",
"trustme",
"requests-mock",
"requests",
"tomlkit",
"pre-commit",
"setuptools==65.7.0",
"pip",
'"cython<3.0.0"'
],
"install": "pip install --force-reinstall -e . || true; pip install -e .[test] || true; pip install -e .[testing] || true; pip install -e .[tests] || true; pip install -e .[dev] || true",
"pre_install": ["apt update && apt install -y make gcc g++ pkg-config"],
"test_cmd": TEST_PYTEST_WO_DEPRECATION,
}
}
MAP_REPO_TO_REQS_PATHS_PLACEHOLDER = [
"requirements.txt",
"requirements-dev.txt",
"requirements-test.txt",
"requirements_test.txt",
"requirements_dev.txt",
]
MAP_REPO_TO_REQS_PATHS = defaultdict(
lambda: MAP_REPO_TO_REQS_PATHS_PLACEHOLDER, MAP_REPO_TO_REQS_PATHS
)
For the log_parsers configuration, I used a default setup:
from swebench.harness.log_parsers.python import MAP_REPO_TO_PARSER_PY, parse_log_pytest
from collections import defaultdict
MAP_REPO_TO_PARSER = defaultdict(
lambda: parse_log_pytest,
{**MAP_REPO_TO_PARSER_JS, **MAP_REPO_TO_PARSER_PY}
)
How did you recover the patches for your task instances in the first place? Did you use SWE-bench's collection script? (e.g. here)
If you used that script, then I'm not sure why the patch wouldn't apply.
For SWE-bench tasks, git apply example.diff usually works. The additional approaches + flags are to resolve things such as whitespace errors.
By the way, if you're trying to create a training dataset to train an LM to solve SWE tasks, I'd definitely recommend checking out SWE-smith (code).
SWE-bench's collection strategy is fairly complicated - SWE-smith's is much simpler without sacrificing anything in terms of quality of the task instances.
How did you recover the patches for your task instances in the first place? Did you use SWE-bench's collection script? (e.g. here)
If you used that script, then I'm not sure why the patch wouldn't apply.
For SWE-bench tasks,
git apply example.diffusually works. The additional approaches + flags are to resolve things such as whitespace errors.By the way, if you're trying to create a training dataset to train an LM to solve SWE tasks, I'd definitely recommend checking out SWE-smith (code).
SWE-bench's collection strategy is fairly complicated - SWE-smith's is much simpler without sacrificing anything in terms of quality of the task instances.
I followed the first two steps of the collection procedure: #collection-procedure to obtain the patches.
Next, I needed to perform validation, which involves creating Docker images and retrieving PASS_TO_PASS and FAIL_TO_PASS.
The dataset zengliangcs/SWE-Ours-temp contains the data collected from the first two steps.
During the third step—validation—some instances resulted in Patch Apply Failed.
The commit version you provided lacks details on image creation and retrieving PASS_TO_PASS and FAIL_TO_PASS.
Therefore, I referred to the issue:
issues/287
and implemented a run_validation based on run_evaluation.
Specifically, I set a unified version, constants, and log_parsers configuration for the extended repos.
I also optimized make_spec using parallel execution to reduce the time spent fetching requirements from GitHub in bulk, which requires setting the GITHUB_TOKENS environment variable:
export GITHUB_TOKENS="Your Tokens"
Then, you can run the validation step with:
python -m swebench.harness.run_validation --dataset_name zengliangcs/SWE-Ours-temp --split train --max_workers 16 --timeout 360 --cache_level instance --run_id 0507
The run_validation script is implemented based on run_evaluation.
You can find many Patch Apply Failed logs in the logs/run_validation/gold directory.
According to the patch application logic, the following commands were tried in sequence, but all failed:
GIT_APPLY_CMDS = [
"git apply --verbose",
"git apply --verbose --reject",
"patch --batch --fuzz=5 -p1 -i",
]
It’s possible that these commands are not fully equivalent to the reverse operation of the git merge process.
We haven't been experiencing these issues with users' submissions recently. I'm going to close this issue for now, but please open a new issue if you continue experiencing problems.