Reproducible tests
Description
I have seen some Python tests failing in a non-deterministic manner.
- https://github.com/ietf-tools/datatracker/actions/runs/10786037653/job/29912248540#step:6:1902
- https://github.com/ietf-tools/datatracker/actions/runs/10802493115/job/29964666951?pr=7921#step:6:1897
- https://github.com/ietf-tools/datatracker/actions/runs/10802493115/job/29971686883?pr=7921#step:6:1896
They are not always reproducible, and they usually pass when retried.
It would be helpful to identify where the source of such randomness is in the tests, and ensure more consistent behaviour.
Code of Conduct
- [X] I agree to follow the IETF's Code of Conduct
In the dev team chat, it was suggested that we look into using a consistent random seed for the factories / fakers as we do for playwright tests.
I think we're currently relying on that randomness / low-grade fuzzing to compensate for tests that are not granular enough to be complete tests of their input. It'd be great to break things down into small enough units that we could do this, but it'll be a huge project. Some of the older tests are enormous!
As much as possible we should try to treat the intermittent failures as bugs and try to avoid the habit of re-running them until they pass, especially if it's not just a "can't run at midnight" timing issue.
Some related tickets:
- https://github.com/ietf-tools/datatracker/issues/5834 (where we've been tracking such transient failures)
- https://github.com/ietf-tools/datatracker/issues/4758
- https://github.com/ietf-tools/datatracker/issues/5173
- https://github.com/ietf-tools/datatracker/issues/6003 (different topic, but involves adjusting fakers)
Maybe we should unify the list - we've also been flagging these sorts with the "transient test failures" label, at least sometimes.
closing this as the action run logs are long gone - nothing to mine now.