stride
stride copied to clipboard
Improve visual parity tests by adding a threshold
Is your feature request related to a problem? Please describe. When there's subtle differences between platforms or graphics cards there may be subtle differences in generated images. I had a test fail because there was 1 pixel difference in the images (reference vs screenshot).
Describe the solution you'd like
Add a threshold parameter in ImageTester.CompareImage
. The simplest thing would be to just calculate a percentage of different pixels divided by all pixels in the image. This way we can allow 1-2% changes for different hardware before it needs manual inspection for differences.
I don't think there's much point in implementing a more complex comparison algorithm, but in case people are interested in the topic here's an interesting SO post
The threshold should be configurable per end user test class.
I believe this specific test is to ensure rendering stays consistent between runs. Implementing a threshold would just sidestep the issue while reducing the test's usefulness. Rendering can be inconsistent across vendors, APIs, drivers or even driver settings - but it definitely should be consistent if none of those changes, given that, here's what I propose:
- From a known good state, we compile the test scene as an executable.
- Include this compiled executable to that test
- Run that executable first when the test boots up
- Retrieve the image it generated, this will be our reference image, it shouldn't change since the executable and assets are all precompiled.
- the test continues -> it compiles the graphics stuff and renders the scene based on the changes the user made to his branch
- We now compare the results, if it failed then the changes the user introduced to that branch messed with the rendering.
This is a lot more complex than what you propose but I do think it is more accurate to what that test is trying to do. If we can't do what I'm proposing I say it is better to leave that specific test to only run in the CI, that environment is completely static so it should always pass if the reference is set up right.
I think @manio143 already builds the images from master
, effectively zeroing out differences between platform and graphics card. The issue I think he looks to address is that on my system all tests succeed, on his system they used to, but now on his system one test trips over just one pixel?
If that is the case then the test is not consistent run to run, which is what should be addressed I think
Where I'm coming from with this is thinking about end user scenarios - we should use high confidence thresholds like 0% for Stride's tests but it may not be necessary for all cases. I imagine a situation where a user wants to check some scene in their game works well by comparing screenshots after every meaningful action, but if they change a small detail in the scene it may not make sense to start failing those tests straight away. I guess it depends how often screenshots would break and how much value they would provide to the user.
We may want to see what is the spread of success rate of those tests on various hardware and try to pinpoint why are there changes happening.