SWE-bench icon indicating copy to clipboard operation
SWE-bench copied to clipboard

Error occurred when executing `create_text_dataset`.

Open lycfight opened this issue 8 months ago • 2 comments

Describe the bug

I want to try running create_text_dataset to create a dataset similar in format to princeton-nlp/SWE-bench_oracle. I followed the steps in the documentation to run the command, but an error occurred.

Steps/Code to Reproduce

The same as the steps in the documentation. python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle

Expected Results

I believe it should successfully generate data in a format similar to princeton-nlp/SWE-bench_oracle, which includes the text field.

Actual Results

(SWE-bench) root@cpu01-2050-SWE-bench:~/SWE-bench# python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle
2025-04-07 17:54:32,438 - datasets - INFO - PyTorch version 2.6.0 available.
2025-04-07 17:54:34,120 - swebench.inference.make_datasets.tokenize_dataset - WARNING - Disabling caching
2025-04-07 17:54:36,563 - __main__ - INFO - Found {'dev', 'test', 'train'} splits
2025-04-07 17:54:36,563 - __main__ - INFO - Processing train split
2025-04-07 17:54:37,937 - swebench.inference.make_datasets.create_instance - INFO - Found 75 already processed instances
2025-04-07 17:54:37,939 - swebench.inference.make_datasets.create_instance - INFO - Processing 18933 instances
Processing instances:   0%|                                                                                                                                          | 0/18933 [00:00<?, ?it/s]Failed on instance Lightning-AI__lightning-1108 Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning
  stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'...
remote: Repository not found.
fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found
'
Traceback (most recent call last):
  File "/root/SWE-bench/swebench/inference/make_datasets/create_instance.py", line 410, in add_text_inputs
    with AutoContextManager(instance, root_dir, verbose=verbose) as cm:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/SWE-bench/swebench/inference/make_datasets/utils.py", line 203, in __init__
    Repo.clone_from(repo_url, repo_dir)
  File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1541, in clone_from
    return cls._clone(
           ^^^^^^^^^^^
  File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1412, in _clone
    finalize_process(proc, stderr=stderr)
  File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/util.py", line 504, in finalize_process
    proc.wait(**kwargs)
  File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/cmd.py", line 834, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning
  stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'...
remote: Repository not found.
fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found
'

System Information

Linux Python 3.9 swebench

lycfight avatar Apr 07 '25 10:04 lycfight

Describe the bug

I want to try running create_text_dataset to create a dataset similar in format to princeton-nlp/SWE-bench_oracle. I followed the steps in the documentation to run the command, but an error occurred.

Steps/Code to Reproduce

The same as the steps in the documentation. python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle

Expected Results

I believe it should successfully generate data in a format similar to princeton-nlp/SWE-bench_oracle, which includes the text field.

Actual Results

(SWE-bench) root@cpu01-2050-SWE-bench:~/SWE-bench# python -m swebench.inference.make_datasets.create_text_dataset --dataset_name_or_path princeton-nlp/SWE-bench --output_dir ./base_datasets --prompt_style style-3 --file_source oracle 2025-04-07 17:54:32,438 - datasets - INFO - PyTorch version 2.6.0 available. 2025-04-07 17:54:34,120 - swebench.inference.make_datasets.tokenize_dataset - WARNING - Disabling caching 2025-04-07 17:54:36,563 - __main__ - INFO - Found {'dev', 'test', 'train'} splits 2025-04-07 17:54:36,563 - __main__ - INFO - Processing train split 2025-04-07 17:54:37,937 - swebench.inference.make_datasets.create_instance - INFO - Found 75 already processed instances 2025-04-07 17:54:37,939 - swebench.inference.make_datasets.create_instance - INFO - Processing 18933 instances Processing instances: 0%| | 0/18933 [00:00<?, ?it/s]Failed on instance Lightning-AI__lightning-1108 Cmd('git') failed due to: exit code(128) cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'... remote: Repository not found. fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found ' Traceback (most recent call last): File "/root/SWE-bench/swebench/inference/make_datasets/create_instance.py", line 410, in add_text_inputs with AutoContextManager(instance, root_dir, verbose=verbose) as cm: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/SWE-bench/swebench/inference/make_datasets/utils.py", line 203, in __init__ Repo.clone_from(repo_url, repo_dir) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1541, in clone_from return cls._clone( ^^^^^^^^^^^ File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/repo/base.py", line 1412, in _clone finalize_process(proc, stderr=stderr) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/util.py", line 504, in finalize_process proc.wait(**kwargs) File "/root/miniforge3/envs/SWE-bench/lib/python3.12/site-packages/git/cmd.py", line 834, in wait raise GitCommandError(remove_password_if_present(self.args), status, errstr) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git clone -v -- https://*****@github.com/swe-bench-repos/Lightning-AI__lightning.git /tmp/tmpkdmsrcmg/Lightning-AI__lightning stderr: 'Cloning into '/tmp/tmpkdmsrcmg/Lightning-AI__lightning'... remote: Repository not found. fatal: repository 'https://github.com/swe-bench-repos/Lightning-AI__lightning.git/' not found '

System Information

Linux Python 3.9 swebench

The issue seems to be that many repos cannot be found in swe-bench-repos. swe-bench-repos

lycfight avatar Apr 07 '25 10:04 lycfight

Okay, there's an issue with generation for the train split at the moment. Are you trying to generate instances for train or the test split? I'm not sure when we'll be able to fix instances in train, but you should be able to generate test split instances by adding the --split test argument using the latest version of swebench (3.0.16). I suggest upgrading swebench with pip install -U swebench or pulling from main again to use the most recent version.

carlosejimenez avatar Apr 09 '25 23:04 carlosejimenez

We've fixed swe-bench-repos to include all mirrors for the train split.

carlosejimenez avatar Jul 16 '25 23:07 carlosejimenez