MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

Implement MSELoss Function

Open o2buzzle opened this issue 1 year ago • 12 comments

This PR ports the MSELoss family of loss function to MIOpen:

  • MSELoss
  • MSELossUnreduced

Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).

Sample performance measurements

float32
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss float32 [32 80]-sum noncontiguous tacotron2 bwd 20784 11804 mseloss -I 32x80 -r sum -Z 0_bwd 1.76075906472382
MSELoss float32 [25 100 100]-mean noncontiguous random-md fwd 27568 16551 mseloss -I 25x100x100 -r mean -Z 0 1.66563953839647
MSELoss float32 [25 300]-sum noncontiguous random-md bwd 15888 10204 mseloss -I 25x300 -r sum -Z 0_bwd 1.55703645629165
MSELoss float32 [10 10 50 100 100]-none noncontiguous random-lg bwd 14407533 9301430 mseloss -I 10x10x50x100x100 -r none -Z 0_bwd 1.54895892352036
MSELoss float32 [32 80]-mean contiguous tacotron2 fwd 16544 10791 mseloss -I 32x80 -r mean -Z 1_contig 1.53312945973496
MSELoss float32 [25 300]-mean noncontiguous random-md bwd 15632 10400 mseloss -I 25x300 -r mean -Z 0_bwd 1.50307692307692
MSELoss float32 [32 80]-none noncontiguous tacotron2 bwd 17760 12089 mseloss -I 32x80 -r none -Z 0_bwd 1.46910414426338
MSELoss float32 [32 80]-sum contiguous tacotron2 fwd 15952 10862 mseloss -I 32x80 -r sum -Z 1_contig 1.46860614988032
MSELoss float32 [10 100 100 100]-none noncontiguous random-lg bwd 2423129 1661950 mseloss -I 10x100x100x100 -r none -Z 0_bwd 1.45800355004663
MSELoss float32 [2000 3000]-none noncontiguous random-lg bwd 547674 376328 mseloss -I 2000x3000 -r none -Z 0_bwd 1.45531026126145
MSELoss float32 [25 100 100]-sum contiguous random-md fwd 23745 16676 mseloss -I 25x100x100 -r sum -Z 1_contig 1.42390261453586
MSELoss float32 [25 100 100]-mean contiguous random-md fwd 23760 16711 mseloss -I 25x100x100 -r mean -Z 1_contig 1.42181796421519
MSELoss float32 [2000 3000]-sum contiguous random-lg fwd 96514 68268 mseloss -I 2000x3000 -r sum -Z 1_contig 1.41375168453741
MSELoss float32 [1000000]-mean noncontiguous random-lg fwd 33825 24231 mseloss -I 1000000 -r mean -Z 0 1.39593908629442
MSELoss float32 [32 80 870]-sum contiguous t5 fwd 48673 34880 mseloss -I 32x80x870 -r sum -Z 1_contig 1.39544151376147
MSELoss float32 [100 20 20 20]-none noncontiguous random-md bwd 83490 59877 mseloss -I 100x20x20x20 -r none -Z 0_bwd 1.39435843479132
MSELoss float32 [32 80 870]-mean contiguous t5 fwd 48273 34934 mseloss -I 32x80x870 -r mean -Z 1_contig 1.38183431613901
MSELoss float32 [100 10 10 10 10]-none noncontiguous random-md bwd 98626 71770 mseloss -I 100x10x10x10x10 -r none -Z 0_bwd 1.37419534624495
MSELoss float32 [128 80 870]-sum contiguous tacotron2 fwd 129890 94687 mseloss -I 128x80x870 -r sum -Z 1_contig 1.37178282129543
MSELoss float32 [2000 3000]-mean contiguous random-lg fwd 94226 68694 mseloss -I 2000x3000 -r mean -Z 1_contig 1.37167729350453
float16
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss float16 [25 100 100]-mean noncontiguous random-md fwd 30241 16124 mselossfp16 -I 25x100x100 -r mean -Z 0 1.87552716447532
MSELoss float16 [25 300]-none noncontiguous random-md bwd 18624 10080 mselossfp16 -I 25x300 -r none -Z 0_bwd 1.84761904761905
MSELoss float16 [25 100 100]-sum noncontiguous random-md fwd 28785 16160 mselossfp16 -I 25x100x100 -r sum -Z 0 1.78125
MSELoss float16 [32 80]-mean contiguous tacotron2 fwd 16880 10506 mselossfp16 -I 32x80 -r mean -Z 1_contig 1.6067009328003
MSELoss float16 [32 80]-sum contiguous tacotron2 fwd 17008 10649 mselossfp16 -I 32x80 -r sum -Z 1_contig 1.59714527185651
MSELoss float16 [25 300]-sum noncontiguous random-md bwd 16368 10275 mselossfp16 -I 25x300 -r sum -Z 0_bwd 1.59299270072993
MSELoss float16 [32 80]-sum noncontiguous tacotron2 bwd 19920 12515 mselossfp16 -I 32x80 -r sum -Z 0_bwd 1.59168997203356
MSELoss float16 [25 300]-mean noncontiguous random-md bwd 16448 10453 mselossfp16 -I 25x300 -r mean -Z 0_bwd 1.5735195637616
MSELoss float16 [32 80]-none noncontiguous tacotron2 bwd 18128 12035 mselossfp16 -I 32x80 -r none -Z 0_bwd 1.50627336933943
MSELoss float16 [25 100 100]-sum contiguous random-md fwd 23392 15911 mselossfp16 -I 25x100x100 -r sum -Z 1_contig 1.47017786437056
MSELoss float16 [32 80]-mean noncontiguous tacotron2 bwd 17344 12125 mselossfp16 -I 32x80 -r mean -Z 0_bwd 1.43043298969072
MSELoss float16 [10 10 50 100 100]-none noncontiguous random-lg bwd 11586974 8183960 mselossfp16 -I 10x10x50x100x100 -r none -Z 0_bwd 1.41581508218515
MSELoss float16 [25 100 100]-mean contiguous random-md fwd 22784 16107 mselossfp16 -I 25x100x100 -r mean -Z 1_contig 1.41454026199789
MSELoss float16 [25 100 100]-none noncontiguous random-md bwd 30769 22134 mselossfp16 -I 25x100x100 -r none -Z 0_bwd 1.39012379145206
MSELoss float16 [100 20 20 20]-none noncontiguous random-md bwd 75393 57352 mselossfp16 -I 100x20x20x20 -r none -Z 0_bwd 1.31456618775282
MSELoss float16 [100 10 10 10 10]-none noncontiguous random-md bwd 89410 68943 mselossfp16 -I 100x10x10x10x10 -r none -Z 0_bwd 1.29686842754159
MSELoss float16 [2000 3000]-none noncontiguous random-lg bwd 463896 360150 mselossfp16 -I 2000x3000 -r none -Z 0_bwd 1.28806330695544
MSELoss float16 [10 100 100 100]-none noncontiguous random-lg bwd 1869280 1504310 mselossfp16 -I 10x100x100x100 -r none -Z 0_bwd 1.24261621607249
MSELoss float16 [25 100 100]-mean noncontiguous random-md bwd 26752 21618 mselossfp16 -I 25x100x100 -r mean -Z 0_bwd 1.23748727911925
MSELoss float16 [1000000]-mean contiguous random-lg fwd 29456 24249 mselossfp16 -I 1000000 -r mean -Z 1_contig 1.21473050435069
bfloat16
op_name dtype Size-reduction contiguous model direction rocm kernel time miopen kernel time miopendriver cmdline rocm/miopen
MSELoss bfloat16 [100 20 20 20]-sum noncontiguous random-md fwd 43889 22329 mselossbfp16 -I 100x20x20x20 -r sum -Z 0 1.96556048188454
MSELoss bfloat16 [25 300]-sum contiguous random-md fwd 19904 10169 mselossbfp16 -I 25x300 -r sum -Z 1_contig 1.95732127052808
MSELoss bfloat16 [100 10 10 10 10]-mean noncontiguous random-md fwd 48049 24693 mselossbfp16 -I 100x10x10x10x10 -r mean -Z 0 1.94585510063581
MSELoss bfloat16 [10000]-mean noncontiguous random-md fwd 19664 10151 mselossbfp16 -I 10000 -r mean -Z 0 1.93714904935474
MSELoss bfloat16 [100 20 20 20]-mean noncontiguous random-md fwd 42273 22347 mselossbfp16 -I 100x20x20x20 -r mean -Z 0 1.89166331051148
MSELoss bfloat16 [25 300]-none noncontiguous random-md bwd 19216 10293 mselossbfp16 -I 25x300 -r none -Z 0_bwd 1.86689983483921
MSELoss bfloat16 [25 100 100]-sum noncontiguous random-md fwd 29441 16302 mselossbfp16 -I 25x100x100 -r sum -Z 0 1.80597472702736
MSELoss bfloat16 [25 100 100]-mean noncontiguous random-md fwd 28848 16107 mselossbfp16 -I 25x100x100 -r mean -Z 0 1.79102253678525
MSELoss bfloat16 [32 80]-mean contiguous tacotron2 fwd 17456 10240 mselossbfp16 -I 32x80 -r mean -Z 1_contig 1.7046875
MSELoss bfloat16 [32 80]-sum contiguous tacotron2 fwd 17616 10507 mselossbfp16 -I 32x80 -r sum -Z 1_contig 1.67659655467783
MSELoss bfloat16 [32 80]-sum noncontiguous tacotron2 bwd 21296 13013 mselossbfp16 -I 32x80 -r sum -Z 0_bwd 1.63651732882502
MSELoss bfloat16 [25 100 100]-sum contiguous random-md fwd 24832 15982 mselossbfp16 -I 25x100x100 -r sum -Z 1_contig 1.55374796646227
MSELoss bfloat16 [25 100 100]-mean contiguous random-md fwd 24592 15911 mselossbfp16 -I 25x100x100 -r mean -Z 1_contig 1.5455973854566
MSELoss bfloat16 [10 100 100 100]-none noncontiguous random-lg bwd 1945089 1296260 mselossbfp16 -I 10x100x100x100 -r none -Z 0_bwd 1.50053924367025
MSELoss bfloat16 [10 10 50 100 100]-none noncontiguous random-lg bwd 11904466 8223780 mselossbfp16 -I 10x10x50x100x100 -r none -Z 0_bwd 1.44756620434885
MSELoss bfloat16 [25 300]-sum noncontiguous random-md bwd 15104 10595 mselossbfp16 -I 25x300 -r sum -Z 0_bwd 1.42557810287872
MSELoss bfloat16 [25 300]-mean noncontiguous random-md bwd 14992 10649 mselossbfp16 -I 25x300 -r mean -Z 0_bwd 1.40783172128838
MSELoss bfloat16 [25 100 100]-none noncontiguous random-md bwd 31489 22685 mselossbfp16 -I 25x100x100 -r none -Z 0_bwd 1.38809786202336
MSELoss bfloat16 [32 80]-none noncontiguous tacotron2 bwd 17792 13155 mselossbfp16 -I 32x80 -r none -Z 0_bwd 1.35248954770049
MSELoss bfloat16 [32 80]-mean noncontiguous tacotron2 bwd 16928 12835 mselossbfp16 -I 32x80 -r mean -Z 0_bwd 1.3188936501753

Average performance

  • MSELoss
dtype forward backward
float32 2.52 0.74
float16 2.08 0.66
bfloat16 2.12 0.66
  • MSELossUnreduced
dtype forward backward
float32 0.53 0.92
float16 0.45 0.82
bfloat16 0.49 0.86

Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.

o2buzzle avatar Jul 26 '24 10:07 o2buzzle

Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts

o2buzzle avatar Jul 26 '24 10:07 o2buzzle

Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own

o2buzzle avatar Jul 29 '24 04:07 o2buzzle

~~Hmmm... is the Static build the same as setting -DMIOPEN_EMBED_BUILD=On? Because if I set that on my local build it seems that even develop failed to build.~~ nvm it's something else

o2buzzle avatar Aug 01 '24 04:08 o2buzzle

Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause)

o2buzzle avatar Aug 02 '24 02:08 o2buzzle

This algorithm is very similar to https://github.com/ROCm/MIOpen/pull/3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too?

Also could you remove GPU specific parts from CPU implementation (more details and in this comment https://github.com/ROCm/MIOpen/pull/3143#discussion_r1711335230) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

CAHEK7 avatar Aug 09 '24 12:08 CAHEK7

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

o2buzzle avatar Aug 09 '24 16:08 o2buzzle

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another.

There are only two ways to get it fixed and both require to get rid of GPU-specific code.

  1. increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
  2. if increasing tolerance is not possible, the algorithm must be improved, because it does not produce mathematically correct result with acceptable error.

CAHEK7 avatar Aug 09 '24 16:08 CAHEK7

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

it calculates something else

Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

o2buzzle avatar Aug 09 '24 17:08 o2buzzle

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm.

We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators.

CAHEK7 avatar Aug 09 '24 19:08 CAHEK7

I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that

o2buzzle avatar Aug 22 '24 11:08 o2buzzle

@CAHEK7 please take a look the latest changes and comments if you have some concerns.

iq136boy avatar Sep 24 '24 17:09 iq136boy

git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews

o2buzzle avatar Sep 30 '24 03:09 o2buzzle

MIOpen is moving to the new monorepo setup and all older unmerged PR's are being closed. Please re-open this as part of the new repo if these changes are still needed.

BradPepersAMD avatar Jul 14 '25 06:07 BradPepersAMD