SWE-bench
SWE-bench copied to clipboard
How stable is run_evaluation.py with gold patch for SWE-bench_Verified?
Describe the issue
I have run the evaluation script with --predictions_path gold on the 500 tasks in SWE-bench_Verified and 14 of them are failing.
I'm using the current main branch of swebench: c63a11369d6e2d5c2d11d8cdd50d2e39f93d9f3d
This is the exact command:
python -m swebench.harness.run_evaluation --predictions_path gold --max_workers 25 --run_id validate-gold-verified --dataset_name princeton-nlp/SWE-bench_Verified --cache_level instance
Those are the unresolved task ids:
"unresolved_ids": [
"astropy__astropy-7166",
"astropy__astropy-7336",
"astropy__astropy-7606",
"astropy__astropy-7671",
"astropy__astropy-8707",
"astropy__astropy-8872",
"django__django-10097",
"matplotlib__matplotlib-20488",
"psf__requests-2317",
"pylint-dev__pylint-6528",
"pylint-dev__pylint-7080",
"pylint-dev__pylint-7277",
"sphinx-doc__sphinx-10323",
"sphinx-doc__sphinx-10435"
],
Then, I ran it a second time, and got 15 unresolved tasks:
"unresolved_ids": [
"astropy__astropy-7166",
"astropy__astropy-7336",
"astropy__astropy-7606",
"astropy__astropy-7671",
"astropy__astropy-8707",
"astropy__astropy-8872",
"django__django-10097",
"matplotlib__matplotlib-20488",
"psf__requests-1766",
"psf__requests-2317",
"pylint-dev__pylint-6528",
"pylint-dev__pylint-7080",
"pylint-dev__pylint-7277",
"sphinx-doc__sphinx-10323",
"sphinx-doc__sphinx-10435"
],
Suggest an improvement to documentation
No response
This might be related to #225, #167, #246, #267, and #274