SWE-bench
SWE-bench copied to clipboard
Failing benchmark instances
Great job with the new containerized evaluation tool! I've run it a couple of times on the golden patches on SWE-bench Lite and overall it gives a more stable result than my swe-bench-docker setup. There are a few instances that fail intermittently, though. Some I recognize from tests in swe-bench-docker, and some are new. None of them are failing in 100% of the runs.
Django instances
In all the failing Django instances I've checked, the tests seem to pass but are marked as failed because other logs are being printed in the test results.
Here's an example of a test that is marked as failed:
test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... System check identified no issues (0 silenced).
ok
The same test in a successful test output log
test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... ok
Other instances
In the following instances some different tests fails intermittently and I haven't found the root cause. I got the same issues in swe-bench-docker with matplotlib and sympy instances. I haven't got issues with psf__requests though.
- matplotlib__matplotlib-23987
- psf__requests-1963
- psf__requests-2317
- psf__requests-2674
- sympy__sympy-13177
- sympy__sympy-13146
Have you experienced the same issues? Is it also be possible for you to share your run_instance_logs
somewhere to compare to your successful evaluation runs. Would be nice to nail this once and for all :)
I've run the benchmarks on Ubuntu 22 VMs with 16 cores on Azure (max_workers = 14
)