SWE-bench
SWE-bench copied to clipboard
Error occurred when executing `create_text_dataset`.
Describe the bug
I want to try running create_text_dataset to create a dataset similar in format to princeton-nlp/SWE-bench_oracle. I followed the steps in the documentation to run the command, but an error occurred.
Steps/Code to Reproduce
The same as the steps in the documentation.
python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle
Expected Results
I believe it should successfully generate data in a format similar to princeton-nlp/SWE-bench_oracle, which includes the text field.
Actual Results
(SWE-bench) root@cpu01-2050-SWE-bench:~/SWE-bench# python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle
2025-04-07 17:54:32,438 - datasets - INFO - PyTorch version 2.6.0 available.
2025-04-07 17:54:34,120 - swebench.inference.make_datasets.tokenize_dataset - WARNING - Disabling caching
2025-04-07 17:54:36,563 - __main__ - INFO - Found {'dev', 'test', 'train'} splits
2025-04-07 17:54:36,563 - __main__ - INFO - Processing train split
2025-04-07 17:54:37,937 - swebench.inference.make_datasets.create_instance - INFO - Found 75 already processed instances
2025-04-07 17:54:37,939 - swebench.inference.make_datasets.create_instance - INFO - Processing 18933 instances
Processing instances: 0%| | 0/18933 [00:00<?, ?it/s]Failed on instance Lightning-AI__lightning-1108 Cmd('git') failed due to: exit code(128)
cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning
stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'...
remote: Repository not found.
fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found
'
Traceback (most recent call last):
File "/root/SWE-bench/swebench/inference/make_datasets/create_instance.py", line 410, in add_text_inputs
with AutoContextManager(instance, root_dir, verbose=verbose) as cm:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/SWE-bench/swebench/inference/make_datasets/utils.py", line 203, in __init__
Repo.clone_from(repo_url, repo_dir)
File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1541, in clone_from
return cls._clone(
^^^^^^^^^^^
File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1412, in _clone
finalize_process(proc, stderr=stderr)
File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/util.py", line 504, in finalize_process
proc.wait(**kwargs)
File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/cmd.py", line 834, in wait
raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning
stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'...
remote: Repository not found.
fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found
'
System Information
Linux Python 3.9 swebench
Describe the bug
I want to try running
create_text_datasetto create a dataset similar in format toprinceton-nlp/SWE-bench_oracle. I followed the steps in the documentation to run the command, but an error occurred.Steps/Code to Reproduce
The same as the steps in the documentation.
python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracleExpected Results
I believe it should successfully generate data in a format similar to
princeton-nlp/SWE-bench_oracle, which includes thetextfield.Actual Results
(SWE-bench) root@cpu01-2050-SWE-bench:~/SWE-bench# python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle 2025-04-07 17:54:32,438 - datasets - INFO - PyTorch version 2.6.0 available. 2025-04-07 17:54:34,120 - swebench.inference.make_datasets.tokenize_dataset - WARNING - Disabling caching 2025-04-07 17:54:36,563 - __main__ - INFO - Found {'dev', 'test', 'train'} splits 2025-04-07 17:54:36,563 - __main__ - INFO - Processing train split 2025-04-07 17:54:37,937 - swebench.inference.make_datasets.create_instance - INFO - Found 75 already processed instances 2025-04-07 17:54:37,939 - swebench.inference.make_datasets.create_instance - INFO - Processing 18933 instances Processing instances: 0%| | 0/18933 [00:00<?, ?it/s]Failed on instance Lightning-AI__lightning-1108 Cmd('git') failed due to: exit code(128) cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'... remote: Repository not found. fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found ' Traceback (most recent call last): File "/root/SWE-bench/swebench/inference/make_datasets/create_instance.py", line 410, in add_text_inputs with AutoContextManager(instance, root_dir, verbose=verbose) as cm: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/SWE-bench/swebench/inference/make_datasets/utils.py", line 203, in __init__ Repo.clone_from(repo_url, repo_dir) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1541, in clone_from return cls._clone( ^^^^^^^^^^^ File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1412, in _clone finalize_process(proc, stderr=stderr) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/util.py", line 504, in finalize_process proc.wait(**kwargs) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/cmd.py", line 834, in wait raise GitCommandError(remove_password_if_present(self.args), status, errstr) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'... remote: Repository not found. fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found 'System Information
Linux Python 3.9 swebench
The issue seems to be that many repos cannot be found in swe-bench-repos.
swe-bench-repos
Okay, there's an issue with generation for the train split at the moment.
Are you trying to generate instances for train or the test split?
I'm not sure when we'll be able to fix instances in train, but you should be able to generate test split instances by adding the --split test argument using the latest version of swebench (3.0.16). I suggest upgrading swebench with pip install -U swebench or pulling from main again to use the most recent version.
We've fixed swe-bench-repos to include all mirrors for the train split.