SWE-bench sometimes gold_patch cannot pass the test

trafficstars

This is a very challenging benchmark, I have learned a lot from it. Thank you for the effort you have put into this.

I tested using the swe-llama13b you provided and found that the number of tasks that can be successfully solved is 0. Then I changed the KEY_PREDICTION from model_patch to patch, which is the target value of the prediction, and found that there are still a large number of tasks that cannot pass the test. I am using a Mac system, and I only made a modification in one place, which is changing sed -i 's/pytest/pytest -rA/' tox.ini to sed -i '' 's/pytest/pytest -rA/' tox.ini, and did not make other modifications beyond this.

For example, below are the results of pytest-dev__pytest-5103 and pylint-dev__pylint-8281 respectively.

Dec 19 '23 12:12 LuoKaiGSW

This seems to happen for me as well for several test cases such as 'psf__requests-1724'

Jan 13 '24 18:01 anmolagarwal999

Hi @LuoKaiGSW @anmolagarwal999, thanks for the great question. To clarify, this should actually be expected behavior.

During the validation phase of task instances, to determine whether a task instance is usable, we check that at least 1 test's status changes from FAIL to PASS (F2P) when comparing the pre-/post- gold patch test results.

However, it is also possible that some tests also change status as:

FAIL to FAIL (F2F)
PASS to PASS (P2P)

We do not evaluate models on F2F tests. We do evaluate models on passing P2P tests (maintenance of original behavior)

If you look at the FAIL_TO_PASS fields for the tasks with these instance IDs, you'll notice that the failing tests referenced above are F2F, as they are not included. This is doubly confirmed by the local pre-/post- gold patch logs I just double checked on my side.

These are the F2P tests for each mentioned task instance

For pytest-5103: ["testing/test_assertrewrite.py::TestAssertionRewrite::test_unroll_expression"]
For pylint-8281: ["tests/lint/unittest_lint.py::test_source_roots_globbing"]
For requests-1724:

"test_requests.py::RequestsTestCase::test_DIGEST_AUTH_RETURNS_COOKIE",
"test_requests.py::RequestsTestCase::test_DIGEST_HTTP_200_OK_GET", 
"test_requests.py::RequestsTestCase::test_different_encodings_dont_break_post", 
"test_requests.py::RequestsTestCase::test_generic_cookiejar_works", 
"test_requests.py::RequestsTestCase::test_uppercase_scheme_redirect", 
"test_requests.py::RequestsTestCase::test_user_agent_transfers"

tl;dr - don't worry about failing tests in the raw logs for gold patches, as we do not evaluate task completion on these tests.

Jan 15 '24 06:01 john-b-yang

@john-b-yang Thanks for your response. I was actually talking about FAIL_TO_PASS and PASS_TO_PASS test cases only. Turns out that there are some logs which get rendered as: which in .txt format look like:

The parser used to check which test cases passed did not seem to account for this. Fixing the parser to ignore the color-related ascii test fixed it for me.

Jan 15 '24 06:01 anmolagarwal999

test_different_encodings_dont_break_post

Really thanks for your kind response! @john-b-yang

When I run the evaluation scripts on the gold patch, there are many cases that cannot be regarded as resolved.

For example, psf__requests-1724 cannot pass one of the cases you mentioned above.

FAILED test_requests.py::RequestsTestCase::test_different_encodings_dont_break_post

The gold patch from GitHub is {'instance_id': 'psf__requests-1724', 'model_name_or_path': 'gold', 'model_patch': 'diff --git a/requests/sessions.py b/requests/sessions.py\nindex cc72f65d9d..175712f976 100644\n--- a/requests/sessions.py\n+++ b/requests/sessions.py\n@@ -12,7 +12,7 @@\n from collections import Mapping\n from datetime import datetime\n \n-from .compat import cookielib, OrderedDict, urljoin, urlparse, urlunparse\n+from .compat import cookielib, OrderedDict, urljoin, urlparse, urlunparse, builtin_str\n from .cookies import cookiejar_from_dict, extract_cookies_to_jar, RequestsCookieJar\n from .models import Request, PreparedRequest\n from .hooks import default_hooks, dispatch_hook\n@@ -309,6 +309,9 @@ def request(self, method, url,\n :param cert: (optional) if String, path to ssl client cert file (.pem).\n If Tuple, (\'cert\', \'key\') pair.\n """\n+\n+ method = builtin_str(method)\n+\n # Create the Request.\n req = Request(\n method = method.upper(),\n'}

Full log file is shown below,

Task Metadata:
	- Instance ID: psf__requests-1724
	- Testbed: /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0
	- Virtual Env.: /home/taow/.conda/envs/psf__requests__2.0
	- Evaluation Model: gold
>>>>> Applied Patch (pred_try)
>>>>> Applied Patch (pred_try)
Installation Command: /home/taow/.conda/envs/psf__requests__2.0/bin/python -m pip install .
Std. Output: Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Processing /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: requests
  Building wheel for requests (setup.py): started
  Building wheel for requests (setup.py): finished with status 'done'
  Created wheel for requests: filename=requests-2.0.1-py3-none-any.whl size=433869 sha256=0be7c5e264be565f33f0cd0d86df26dc7e29c91073ad1efddc5e16419f6afe90
  Stored in directory: /tmp/pip-ephem-wheel-cache-94vs064d/wheels/fb/0f/a6/e9537780344ac221ec19dcc30388e8881e479cffe7c0006415
Successfully built requests
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.0.1
    Uninstalling requests-2.0.1:
      Successfully uninstalled requests-2.0.1
Successfully installed requests-2.0.1

Std. Error: 

>>>>> Init Succeeded
>>>>> Applied Patch (test)
>>>>> Applied Patch (pred)
Env Script: source /home/pai/bin/activate /home/taow/.conda/envs/psf__requests__2.0;
Test Script: /home/taow/.conda/envs/psf__requests__2.0/bin/pytest --no-header -rA --tb=no -p no:cacheprovider test_requests.py;
Test Dir: /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0;
Output:
============================= test session starts ==============================
collected 88 items

test_requests.py .F..FF...FF.FF..F..F...FFF.F..............F...F.F.F.FF. [ 62%]
F.FF..F..........................                                        [100%]

=============================== warnings summary ===============================
requests/packages/urllib3/_collections.py:7
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/packages/urllib3/_collections.py:7: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import MutableMapping

requests/sessions.py:369
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:369: DeprecationWarning: invalid escape sequence \*
    """Sends a GET request. Returns :class:`Response` object.

requests/sessions.py:379
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:379: DeprecationWarning: invalid escape sequence \*
    """Sends a OPTIONS request. Returns :class:`Response` object.

requests/sessions.py:389
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:389: DeprecationWarning: invalid escape sequence \*
    """Sends a HEAD request. Returns :class:`Response` object.

requests/sessions.py:399
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:399: DeprecationWarning: invalid escape sequence \*
    """Sends a POST request. Returns :class:`Response` object.

requests/sessions.py:409
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:409: DeprecationWarning: invalid escape sequence \*
    """Sends a PUT request. Returns :class:`Response` object.

requests/sessions.py:419
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:419: DeprecationWarning: invalid escape sequence \*
    """Sends a PATCH request. Returns :class:`Response` object.

requests/sessions.py:429
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:429: DeprecationWarning: invalid escape sequence \*
    """Sends a DELETE request. Returns :class:`Response` object.

requests/sessions.py:12
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/sessions.py:12: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import Mapping

test_requests.py::RequestsTestCase::test_BASICAUTH_TUPLE_HTTP_200_OK_GET
  /home/taow/swe-tb/gold/psf__requests/2.0/tmpn3a0wuje/psf__requests__2.0/requests/models.py:156: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    if isinstance(hook, collections.Callable):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
PASSED test_requests.py::RequestsTestCase::test_BASICAUTH_TUPLE_HTTP_200_OK_GET
PASSED test_requests.py::RequestsTestCase::test_DIGEST_AUTH_RETURNS_COOKIE
PASSED test_requests.py::RequestsTestCase::test_DIGEST_AUTH_SETS_SESSION_COOKIES
PASSED test_requests.py::RequestsTestCase::test_HTTP_200_OK_GET_ALTERNATIVE
PASSED test_requests.py::RequestsTestCase::test_HTTP_200_OK_GET_WITH_MIXED_PARAMS
PASSED test_requests.py::RequestsTestCase::test_HTTP_200_OK_GET_WITH_PARAMS
PASSED test_requests.py::RequestsTestCase::test_HTTP_302_ALLOW_REDIRECT_GET
PASSED test_requests.py::RequestsTestCase::test_autoset_header_values_are_native
PASSED test_requests.py::RequestsTestCase::test_basic_building
PASSED test_requests.py::RequestsTestCase::test_can_send_nonstring_objects_with_files
PASSED test_requests.py::RequestsTestCase::test_cannot_send_unprepared_requests
PASSED test_requests.py::RequestsTestCase::test_cookie_parameters
PASSED test_requests.py::RequestsTestCase::test_cookie_persists_via_api
PASSED test_requests.py::RequestsTestCase::test_cookie_quote_wrapped
PASSED test_requests.py::RequestsTestCase::test_decompress_gzip
PASSED test_requests.py::RequestsTestCase::test_entry_points
PASSED test_requests.py::RequestsTestCase::test_fixes_1329
PASSED test_requests.py::RequestsTestCase::test_generic_cookiejar_works
PASSED test_requests.py::RequestsTestCase::test_get_auth_from_url
PASSED test_requests.py::RequestsTestCase::test_header_keys_are_native
PASSED test_requests.py::RequestsTestCase::test_header_remove_is_case_insensitive
PASSED test_requests.py::RequestsTestCase::test_hook_receives_request_arguments
PASSED test_requests.py::RequestsTestCase::test_http_error
PASSED test_requests.py::RequestsTestCase::test_invalid_url
PASSED test_requests.py::RequestsTestCase::test_links
PASSED test_requests.py::RequestsTestCase::test_long_authinfo_in_url
PASSED test_requests.py::RequestsTestCase::test_mixed_case_scheme_acceptable
PASSED test_requests.py::RequestsTestCase::test_no_content_length
PASSED test_requests.py::RequestsTestCase::test_params_are_added_before_fragment
PASSED test_requests.py::RequestsTestCase::test_path_is_not_double_encoded
PASSED test_requests.py::RequestsTestCase::test_prepared_from_session
PASSED test_requests.py::RequestsTestCase::test_prepared_request_hook
PASSED test_requests.py::RequestsTestCase::test_request_ok_set
PASSED test_requests.py::RequestsTestCase::test_response_is_iterable
PASSED test_requests.py::RequestsTestCase::test_set_cookie_on_301
PASSED test_requests.py::RequestsTestCase::test_transport_adapter_ordering
PASSED test_requests.py::RequestsTestCase::test_unicode_header_name
PASSED test_requests.py::RequestsTestCase::test_unicode_multipart_post_fieldnames
PASSED test_requests.py::RequestsTestCase::test_uppercase_scheme_redirect
PASSED test_requests.py::RequestsTestCase::test_user_agent_transfers
PASSED test_requests.py::TestContentEncodingDetection::test_html4_pragma
PASSED test_requests.py::TestContentEncodingDetection::test_html_charset
PASSED test_requests.py::TestContentEncodingDetection::test_none
PASSED test_requests.py::TestContentEncodingDetection::test_precedence
PASSED test_requests.py::TestContentEncodingDetection::test_xhtml_pragma
PASSED test_requests.py::TestContentEncodingDetection::test_xml
PASSED test_requests.py::TestCaseInsensitiveDict::test_contains
PASSED test_requests.py::TestCaseInsensitiveDict::test_delitem
PASSED test_requests.py::TestCaseInsensitiveDict::test_docstring_example
PASSED test_requests.py::TestCaseInsensitiveDict::test_equality
PASSED test_requests.py::TestCaseInsensitiveDict::test_fixes_649
PASSED test_requests.py::TestCaseInsensitiveDict::test_get
PASSED test_requests.py::TestCaseInsensitiveDict::test_getitem
PASSED test_requests.py::TestCaseInsensitiveDict::test_iter
PASSED test_requests.py::TestCaseInsensitiveDict::test_iterable_init
PASSED test_requests.py::TestCaseInsensitiveDict::test_kwargs_init
PASSED test_requests.py::TestCaseInsensitiveDict::test_len
PASSED test_requests.py::TestCaseInsensitiveDict::test_lower_items
PASSED test_requests.py::TestCaseInsensitiveDict::test_mapping_init
PASSED test_requests.py::TestCaseInsensitiveDict::test_preserve_key_case
PASSED test_requests.py::TestCaseInsensitiveDict::test_preserve_last_key_case
PASSED test_requests.py::TestCaseInsensitiveDict::test_setdefault
PASSED test_requests.py::TestCaseInsensitiveDict::test_update
PASSED test_requests.py::TestCaseInsensitiveDict::test_update_retains_unchanged
PASSED test_requests.py::UtilsTestCase::test_super_len_io_streams
FAILED test_requests.py::RequestsTestCase::test_DIGESTAUTH_WRONG_HTTP_401_GET
FAILED test_requests.py::RequestsTestCase::test_DIGEST_HTTP_200_OK_GET - requ...
FAILED test_requests.py::RequestsTestCase::test_DIGEST_STREAM - requests.exce...
FAILED test_requests.py::RequestsTestCase::test_HTTP_200_OK_HEAD - requests.e...
FAILED test_requests.py::RequestsTestCase::test_HTTP_200_OK_PUT - requests.ex...
FAILED test_requests.py::RequestsTestCase::test_POSTBIN_GET_POST_FILES - requ...
FAILED test_requests.py::RequestsTestCase::test_POSTBIN_GET_POST_FILES_WITH_DATA
FAILED test_requests.py::RequestsTestCase::test_basicauth_with_netrc - reques...
FAILED test_requests.py::RequestsTestCase::test_conflicting_post_params - Typ...
FAILED test_requests.py::RequestsTestCase::test_cookie_removed_on_expire - re...
FAILED test_requests.py::RequestsTestCase::test_cookie_sent_on_redirect - req...
FAILED test_requests.py::RequestsTestCase::test_custom_content_type - request...
FAILED test_requests.py::RequestsTestCase::test_different_encodings_dont_break_post
FAILED test_requests.py::RequestsTestCase::test_params_are_merged_case_sensitive
FAILED test_requests.py::RequestsTestCase::test_request_cookie_overrides_session_cookie
FAILED test_requests.py::RequestsTestCase::test_requests_in_history_are_not_overridden
FAILED test_requests.py::RequestsTestCase::test_session_pickling - AttributeE...
FAILED test_requests.py::RequestsTestCase::test_status_raising - requests.exc...
FAILED test_requests.py::RequestsTestCase::test_time_elapsed_blank - requests...
FAILED test_requests.py::RequestsTestCase::test_unicode_get - requests.except...
FAILED test_requests.py::RequestsTestCase::test_unicode_method_name - request...
FAILED test_requests.py::RequestsTestCase::test_unicode_multipart_post - requ...
FAILED test_requests.py::RequestsTestCase::test_urlencoded_get_query_multivalued_param
============ 23 failed, 65 passed, 10 warnings in 698.70s (0:11:38) ============

>>>>> Some Tests Failed

Jan 15 '24 06:01 itaowei

Hi all, thanks for your patience, we will respond more promptly going forward.

We realized that many have been running into common evaluation harness errors. We have spent the last 2 weeks making a lot of improvements to this repo that we think should make evaluation more robust and reliable. You can read the report here.

My suspicion is that the errors you were observing were probably caused by multiple failure cases as detailed in the report. At this point, I'd encourage running evaluation again to see if you are still getting these issues. We are also about to release execution logs + predictions for all models run on SWE-bench, which you can use to corroborate what you're seeing.

Leaving this open, please feel free to follow up if there's any additional questions or if it looks like things are still not working.

Apr 16 '24 03:04 john-b-yang

Hi, I am also still facing the same error (even with no conda errors). The gold patches seem to be not passing all the tests often on both dev and test split for the swe-bench-lite dataset. Is there something I might be missing?

May 01 '24 00:05 1jsingh

Closing this issue due to inactivity.

Thanks for all the helpful feedback in this thread @itaowei @LuoKaiGSW @anmolagarwal999. Although I think SWE-bench evaluation has gotten easier over these last couple months, I do realize that several issues remain due to inconsistencies arising from running SWE-bench on different platforms.

We are about to release a re-vamped SWE-bench evaluation harness that incorporates Docker containers as execution sandboxes, which eliminate many problems such as the ones discussed here.

If you're still interested in working on SWE-bench, please look out for that release! It should be within the next 2 weeks.

@1jsingh if you're interested in getting evaluation to work soon, I'd recommend checking out the latest version of SWE-bench. Otherwise, I think the upcoming release should help a lot.

Jun 17 '24 17:06 john-b-yang

SWE-bench SWE-bench copied to clipboard

sometimes gold_patch cannot pass the test

SWE-bench
SWE-bench copied to clipboard