SWE-bench icon indicating copy to clipboard operation
SWE-bench copied to clipboard

Incorrect unit tests in `FAIL_TO_PASS` and `PASS_TO_PASS`

Open WuYff opened this issue 11 months ago • 6 comments

Describe the bug

Description: The FAIL_TO_PASS and PASS_TO_PASS fields of some instances contain unrelated strings instead of references to unit tests.

An example is provided below. django__django-16950 has ["If form data is provided, a parent's auto-generated alternate key is"] as its FAIL_TO_PASS and some comments as its PASS_TO_PASS

I haven't checked thoroughly, but I can see django__django-15525 and django__django-14792 also have the same problem. Not sure if this will affect the actual evaluation of swebench.

Steps/Code to Reproduce

Buggy Example: django__django-16950

image

Expected Results

The actual unit tests

Actual Results

As described above

System Information

No response

WuYff avatar Dec 17 '24 22:12 WuYff

I think this is the legitimate name of a test, for instance here's the PASS_TO_PASS test referenced in the image.

I'm not sure this needs fixing. Django has its own custom testing software iirc (it doesn't use pytest). From when I last ran, I think the test name + result is printed out (e.g. <test name> ... ok or <test name> ... fail.

Leaving this open for discussion. My current stance is that this doesn't need fixing + is the expected behavior for Django testing.

john-b-yang avatar Jan 13 '25 21:01 john-b-yang

Hi John @john-b-yang, I think this may be the defect of the current parser. It fails to parse several test cases when the test case spans multiple lines in Django. In the case @WuYff shows, the test case is at https://github.com/django/django/blob/main/tests/model_formsets/test_uuid.py, where If form data is provided, a parent's auto-generated alternate key is set. is the doc_string for the test case test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data)

I will pull a request to solve this. We can continue discussing about it.

The snippet of the test log for Django-16950 is in the following:

test_inlineformset_factory_ignores_default_pks_on_submit (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_ignores_default_pks_on_submit)
#24377 - Inlines with a model field default should ignore that default ... ok
test_inlineformset_factory_nulls_default_pks (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks)
#24377 - If we're adding a new object, a parent's auto-generated pk ... ok
test_inlineformset_factory_nulls_default_pks_alternate_key_relation (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_alternate_key_relation_data)
If form data is provided, a parent's auto-generated alternate key is ... ok
test_inlineformset_factory_nulls_default_pks_auto_parent_uuid_child (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_auto_parent_uuid_child)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_child_editable_pk (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_child_editable_pk)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok
test_inlineformset_factory_nulls_default_pks_uuid_parent_auto_child (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_nulls_default_pks_uuid_parent_auto_child)
#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok

BoxiYu avatar Feb 10 '25 03:02 BoxiYu

Oh hmm, like you're suggesting the correct name should be:

test_inlineformset_factory_ignores_default_pks_on_submit (model_formsets.test_uuid.InlineFormsetTests.test_inlineformset_factory_ignores_default_pks_on_submit) #24377 - Inlines with a model field default should ignore that default

Instead of just

#24377 - Inlines with a model field default should ignore that default

Is this the right understanding?

When you re-run evaluation w/ gold patches on all django instances, do you get all instances passing again? (Just realized that this fix would also require an update to the SWE-bench dataset).

I'm still a bit against integrating this change because

  • This would also require a SWE-bench dataset update, which would require re-validation, and that would take a lot of time.
  • Removing spaces from the tests makes the test name less readable.

I understand the existing log parsing may not be a perfect capture of the test naming, but I believe it's quite reliable for the large majority of instances.

An easier fix may be changing the Django parsing logic such that if any test such as

#24958 - Variant of test_inlineformset_factory_nulls_default_pks for ... ok

shows up multiple times, if the test case corresponds to a fail 1+ times, we make sure the test case maps to a fail and does not get overridden by an ok later on. Again, I realize it may not be perfect, but the proposed solution quite costly in terms of time, and it will invalidate existing submissions to SWE-bench - if you would like to carry out this effort, I will support you, but I don't believe it's worthwhile.

john-b-yang avatar Mar 03 '25 06:03 john-b-yang

Yeah, your understanding is right. Indeed I have found that more than half of the test names in SWE-Bench -Lite and -Verified are not correctly recorded.

Meanwhile, I remembered there are multiple times of update of parsering tools in this repo, which might also bring

I have uploaded the updated annoation of SWE-Bench_Lite and SWE-Bench_Verified as below for insepecting:

updated_parser_test_instance_dict_verified.json updated_parser_test_instance_dict_lite.json

I have checked some of the submitted patches and found more than 100 code patches have been erroneously labeld as passed due to this reason. So I think it should be paid more attention. I am pleased to carry out the effort to fix such issue (at least correcting the problems we have already known to make the SWE-Bench evaluation more giorous).

BoxiYu avatar Mar 03 '25 06:03 BoxiYu

I see, ok I'll take a look at these at some point within the next 2 weeks. Thanks for the data!

john-b-yang avatar Mar 03 '25 07:03 john-b-yang

Hi John @john-b-yang , is there an update? Thanks!

BoxiYu avatar Jul 02 '25 01:07 BoxiYu