SWE-bench
SWE-bench copied to clipboard
Incorrect unit tests in `FAIL_TO_PASS` and `PASS_TO_PASS`
Describe the bug
Description:
The FAIL_TO_PASS and PASS_TO_PASS fields of some instances contain unrelated strings instead of references to unit tests.
An example is provided below. django__django-16950 has ["If form data is provided, a parent's auto-generated alternate key is"] as its FAIL_TO_PASS and some comments as its PASS_TO_PASS
I haven't checked thoroughly, but I can see django__django-15525 and django__django-14792 also have the same problem. Not sure if this will affect the actual evaluation of swebench.
Steps/Code to Reproduce
Buggy Example: django__django-16950
Expected Results
The actual unit tests
Actual Results
As described above
System Information
No response
I think this is the legitimate name of a test, for instance here's the PASS_TO_PASS test referenced in the image.
I'm not sure this needs fixing. Django has its own custom testing software iirc (it doesn't use pytest). From when I last ran, I think the test name + result is printed out (e.g. <test name> ... ok or <test name> ... fail.
Leaving this open for discussion. My current stance is that this doesn't need fixing + is the expected behavior for Django testing.
Hi John @john-b-yang, I think this may be the defect of the current parser. It fails to parse several test cases when the test case spans multiple lines in Django.
In the case @WuYff shows, the test case is at https://github.com/django/django/blob/main/tests/model_formsets/test_uuid.py, where If form data is provided, a parent's auto-generated alternate key is set. is the doc_string for the test case test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data)
I will pull a request to solve this. We can continue discussing about it.
The snippet of the test log for Django-16950 is in the following:
test_inlineformset_factory_ignores_default_pks_on_submit (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_ignores_default_pks_on_submit)
#24377 - Inlines with a model field default should ignore that default ... ok
test_inlineformset_factory_nulls_default_pks (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks)
#24377 - If we're adding a new object, a parent's auto-generated pk ... ok
test_inlineformset_factory_nulls_default_pks_alternate_key_relation (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data)
If form data is provided, a parent's auto-generated alternate key is ... ok
test_inlineformset_factory_nulls_default_pks_auto_parent_uuid_child (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_auto_parent_uuid_child)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_child_editable_pk (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_child_editable_pk)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_uuid_parent_auto_child (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_uuid_parent_auto_child)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
Oh hmm, like you're suggesting the correct name should be:
test_inlineformset_factory_ignores_default_pks_on_submit (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_ignores_default_pks_on_submit) #24377 - Inlines with a model field default should ignore that default
Instead of just
#24377 - Inlines with a model field default should ignore that default
Is this the right understanding?
When you re-run evaluation w/ gold patches on all django instances, do you get all instances passing again? (Just realized that this fix would also require an update to the SWE-bench dataset).
I'm still a bit against integrating this change because
- This would also require a SWE-bench dataset update, which would require re-validation, and that would take a lot of time.
- Removing spaces from the tests makes the test name less readable.
I understand the existing log parsing may not be a perfect capture of the test naming, but I believe it's quite reliable for the large majority of instances.
An easier fix may be changing the Django parsing logic such that if any test such as
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
shows up multiple times, if the test case corresponds to a fail 1+ times, we make sure the test case maps to a fail and does not get overridden by an ok later on. Again, I realize it may not be perfect, but the proposed solution quite costly in terms of time, and it will invalidate existing submissions to SWE-bench - if you would like to carry out this effort, I will support you, but I don't believe it's worthwhile.
Yeah, your understanding is right. Indeed I have found that more than half of the test names in SWE-Bench -Lite and -Verified are not correctly recorded.
Meanwhile, I remembered there are multiple times of update of parsering tools in this repo, which might also bring
I have uploaded the updated annoation of SWE-Bench_Lite and SWE-Bench_Verified as below for insepecting:
updated_parser_test_instance_dict_verified.json updated_parser_test_instance_dict_lite.json
I have checked some of the submitted patches and found more than 100 code patches have been erroneously labeld as passed due to this reason. So I think it should be paid more attention. I am pleased to carry out the effort to fix such issue (at least correcting the problems we have already known to make the SWE-Bench evaluation more giorous).
I see, ok I'll take a look at these at some point within the next 2 weeks. Thanks for the data!
Hi John @john-b-yang , is there an update? Thanks!