transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add auto-update system for integration test hardcoded expectations

Open yonigozlan opened this issue 3 weeks ago • 4 comments

This PR introduces a decorator-based system to automatically update hardcoded values in integration tests, making them easier to maintain when switching to different hardware, or other changes impact a large amount of integration tests.

For context, another PR I'm working on to load fast image processors by default (https://github.com/huggingface/transformers/pull/41388) is breaking a lot of integration tests, which is expected, but really annoying to manually fix. I could instead force all the tests to use slow image processor, but that would just be postponing fixing this issue, as we might fully deprecate slow image processors in the future.

The new @record_expectations decorator automatically captures actual values from tests and updates hardcoded expected values directly in the source file when running with UPDATE_EXPECTATIONS=1.

How to use in the test files

Basic usage:

@record_expectations(pairs=[("actual_logits", "expected_logits")])
def test_inference(self):
    actual_logits = model(**inputs).logits
    expected_logits = torch.tensor([[24.5701, 19.3049]])  # Auto-updated
    torch.testing.assert_close(actual_logits, expected_logits)

This also works with the recently introduced Expectations object, where values will be updated only for the current hardware configuration. Cc @ydshieh

@record_expectations(pairs=[("decoded_text", "expected_decoded_text")])
def test_generation(self):
    decoded_text = processor.decode(model.generate(**inputs)[0])
    expected_decoded_text = Expectations({
        ("cuda", (7, None)): "Output on T4...",
        ("cuda", (8, None)): "Output on A10...",
    }).get_expectation()  # Auto-updated per hardware
    self.assertEqual(decoded_text, expected_decoded_text)

We could also add flags to decide what we want to update ( ("cuda", None) by default? or specific hardware ("cuda", (8, 6) by default?). I'm not fully sure what would be best here, so I'd love to hear your thoughts

How to update hardcoded values

To update expectations, we can just pass the UPDATE_EXPECTATIONS=1 flag to the usual pytest commands:

UPDATE_EXPECTATIONS=1 RUN_SLOW=1 python -m pytest tests/models/clip/test_modeling_clip.py -k CLIPModelIntegrationTest

I only added the decorators and made other necessary modifications for a few modeling tests for now, but happy to extend this to others once I get some feedback

I think this could make our lives easier, but I'd love to have your thoughts on this! Cc @ydshieh @ArthurZucker @Cyrilvallez @molbap @zucchini-nlp

yonigozlan avatar Dec 02 '25 22:12 yonigozlan

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This is great work!

But as I commented on slack internally, I think it's best we incorporate the current work in this PR with my previous work regarding

PATCH_TESTING_METHODS_TO_COLLECT_OUTPUTS
patch_testing_methods_to_collect_info

plus a few format improvements.

So please don't merge this PR yet 🙏

ydshieh avatar Dec 03 '25 15:12 ydshieh

[For maintainers] Suggested jobs to run (before merge)

run-slow: clip, kosmos2_5, llava_next_video

github-actions[bot] avatar Dec 03 '25 21:12 github-actions[bot]

looks very nice in general, but IMO we should add huge warnings explaining that this SHOULD NOT be used in general, i.e. the env variable should always be deactivated EXCEPT in a few PRs that we run manually when we expect some slight new numerical differences. Otherwise it's very misleading and makes it look like we force our tests to pass with whatever outputs we get So maybe the decorator should be named manually_record_expectations (not the best IMO lol, but you see the gist) or something, so that it's crystal clear it is NOT SUPPOSED to change values all the time

I talked to Yoni internally, I will take this PR and finalize with something (slightly or a bit more) different .

ydshieh avatar Dec 08 '25 16:12 ydshieh

After more offline discussion with @molbap, we both agreed that this PR is way too dangerous in general. We cannot accept it in the current state. What we believe keeps the same advantage without all the dangers, is a separate helper in utils to do the same, instead of the decorator.

The idea is that this absolutely needs to stay INDEPENDENT, ISOLATED, and MANUALLY RUN. So absolutely no change to tests files should be added for this.

Cyrilvallez avatar Dec 15 '25 15:12 Cyrilvallez