MIOpen error tolerance
Scope
While testing the variant 0 for bnorm backward spatial single, we observed that the error tolerance in MIOpenDriver is sub-optimal. For instance, the following results were observed
Backwards prop batch norm verification passed on dx (0.151171)
Backwards prop batch norm verification passed on dscale (0.277024)
Backwards prop batch norm verification passed on dbias (0.217736)
Backwards Prop Batch Norm Verifies on CPU and GPU.
Notice that the execution is successful, according to the driver checks. However, the error reported is high so this should actually be failing.
Notes for the reviewer
- Fixed the
maxrmsdefinition because it was too big compared to the rms values being computed (due to the rms normalization done). TheVerifyForwardmethod seems to also be using the same value formaxrmsas the new implementation inVerifyBackward. - Fixed the normalization of the rms. Previously we were finding the maximum absolute value from both the reference values and the results, but this may not be correct because it reduces the significance of the rms: if the results differ a lot from the reference values (e.g. ref values are order of 10 and results are order 1000) then the normalization will divide the rms obtained (which will be order of 100) by a number order of 1000 and then the normalized rms will be 100/1000 = 0.1 which is way lower than what it should be.
- The new implementation only takes into account the reference results for computing the normalization factor
Starting CI!
Restarting CI again.
Failing on formatting. Please run the formatting script.
@BrianHarrisonAMD opsie, missed to run this before. Should be ok now!
Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?
Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?
Looks like it was stopped / aborted before fully finishing. Ill run the internal CI to see if it passes.
@BrianHarrisonAMD any updates on this?
I can start the build again, and let @BradPepersAMD know about it.
CI running now.
Did this pass CI and are we good to merge this now?
We need to update this with latest, and re-run CI on it. Since it impacts all tests, and driver commands we need to do a large sweep on these changes before they can be safely merged.
Imported to ROCm/rocm-libraries