MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

MIOpen error tolerance

Open Beanavil opened this issue 9 months ago • 6 comments

Scope

While testing the variant 0 for bnorm backward spatial single, we observed that the error tolerance in MIOpenDriver is sub-optimal. For instance, the following results were observed

Backwards prop batch norm verification passed on dx (0.151171)
Backwards prop batch norm verification passed on dscale (0.277024)
Backwards prop batch norm verification passed on dbias (0.217736)
Backwards Prop Batch Norm Verifies on CPU and GPU.

Notice that the execution is successful, according to the driver checks. However, the error reported is high so this should actually be failing.

Notes for the reviewer

  • Fixed the maxrms definition because it was too big compared to the rms values being computed (due to the rms normalization done). The VerifyForward method seems to also be using the same value for maxrms as the new implementation in VerifyBackward.
  • Fixed the normalization of the rms. Previously we were finding the maximum absolute value from both the reference values and the results, but this may not be correct because it reduces the significance of the rms: if the results differ a lot from the reference values (e.g. ref values are order of 10 and results are order 1000) then the normalization will divide the rms obtained (which will be order of 100) by a number order of 1000 and then the normalized rms will be 100/1000 = 0.1 which is way lower than what it should be.
    • The new implementation only takes into account the reference results for computing the normalization factor

Beanavil avatar Mar 17 '25 20:03 Beanavil

Starting CI!

BrianHarrisonAMD avatar Mar 19 '25 14:03 BrianHarrisonAMD

Restarting CI again.

BrianHarrisonAMD avatar Mar 28 '25 19:03 BrianHarrisonAMD

Failing on formatting. Please run the formatting script.

BrianHarrisonAMD avatar Mar 30 '25 17:03 BrianHarrisonAMD

@BrianHarrisonAMD opsie, missed to run this before. Should be ok now!

Beanavil avatar Mar 31 '25 08:03 Beanavil

Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?

Beanavil avatar Mar 31 '25 12:03 Beanavil

Btw I see that some smoke tests are failing for gfx90a (MI200), but I have double-checked on our side and they pass on our MI200s, how should we proceed for those?

Looks like it was stopped / aborted before fully finishing. Ill run the internal CI to see if it passes.

BrianHarrisonAMD avatar Mar 31 '25 13:03 BrianHarrisonAMD

@BrianHarrisonAMD any updates on this?

Beanavil avatar May 14 '25 09:05 Beanavil

I can start the build again, and let @BradPepersAMD know about it.

BrianHarrisonAMD avatar May 14 '25 14:05 BrianHarrisonAMD

CI running now.

BrianHarrisonAMD avatar May 14 '25 16:05 BrianHarrisonAMD

Did this pass CI and are we good to merge this now?

BradPepersAMD avatar Jul 14 '25 06:07 BradPepersAMD

We need to update this with latest, and re-run CI on it. Since it impacts all tests, and driver commands we need to do a large sweep on these changes before they can be safely merged.

BrianHarrisonAMD avatar Jul 24 '25 16:07 BrianHarrisonAMD

Imported to ROCm/rocm-libraries

amd-hsivasun avatar Jul 28 '25 20:07 amd-hsivasun