How stable is run_evaluation.py with gold patch for SWE-bench_Verified?

Open MarcCote opened this issue 1 year ago • 1 comments

Describe the issue

I have run the evaluation script with --predictions_path gold on the 500 tasks in SWE-bench_Verified and 14 of them are failing.

I'm using the current main branch of swebench: c63a11369d6e2d5c2d11d8cdd50d2e39f93d9f3d

This is the exact command:

python -m swebench.harness.run_evaluation --predictions_path gold --max_workers 25 --run_id validate-gold-verified --dataset_name princeton-nlp/SWE-bench_Verified --cache_level instance

Those are the unresolved task ids:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Then, I ran it a second time, and got 15 unresolved tasks:

    "unresolved_ids": [
        "astropy__astropy-7166",
        "astropy__astropy-7336",
        "astropy__astropy-7606",
        "astropy__astropy-7671",
        "astropy__astropy-8707",
        "astropy__astropy-8872",
        "django__django-10097",
        "matplotlib__matplotlib-20488",
        "psf__requests-1766",
        "psf__requests-2317",
        "pylint-dev__pylint-6528",
        "pylint-dev__pylint-7080",
        "pylint-dev__pylint-7277",
        "sphinx-doc__sphinx-10323",
        "sphinx-doc__sphinx-10435"
    ],

Suggest an improvement to documentation

No response

Jan 20 '25 21:01 MarcCote

This might be related to #225, #167, #246, #267, and #274

Jan 20 '25 21:01 MarcCote