Implement MSELoss Function
This PR ports the MSELoss family of loss function to MIOpen:
MSELossMSELossUnreduced
Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).
Sample performance measurements
float32
| op_name | dtype | Size-reduction | contiguous | model | direction | rocm kernel time | miopen kernel time | miopendriver cmdline | rocm/miopen |
|---|---|---|---|---|---|---|---|---|---|
| MSELoss | float32 | [32 80]-sum | noncontiguous | tacotron2 | bwd | 20784 | 11804 | mseloss -I 32x80 -r sum -Z 0_bwd | 1.76075906472382 |
| MSELoss | float32 | [25 100 100]-mean | noncontiguous | random-md | fwd | 27568 | 16551 | mseloss -I 25x100x100 -r mean -Z 0 | 1.66563953839647 |
| MSELoss | float32 | [25 300]-sum | noncontiguous | random-md | bwd | 15888 | 10204 | mseloss -I 25x300 -r sum -Z 0_bwd | 1.55703645629165 |
| MSELoss | float32 | [10 10 50 100 100]-none | noncontiguous | random-lg | bwd | 14407533 | 9301430 | mseloss -I 10x10x50x100x100 -r none -Z 0_bwd | 1.54895892352036 |
| MSELoss | float32 | [32 80]-mean | contiguous | tacotron2 | fwd | 16544 | 10791 | mseloss -I 32x80 -r mean -Z 1_contig | 1.53312945973496 |
| MSELoss | float32 | [25 300]-mean | noncontiguous | random-md | bwd | 15632 | 10400 | mseloss -I 25x300 -r mean -Z 0_bwd | 1.50307692307692 |
| MSELoss | float32 | [32 80]-none | noncontiguous | tacotron2 | bwd | 17760 | 12089 | mseloss -I 32x80 -r none -Z 0_bwd | 1.46910414426338 |
| MSELoss | float32 | [32 80]-sum | contiguous | tacotron2 | fwd | 15952 | 10862 | mseloss -I 32x80 -r sum -Z 1_contig | 1.46860614988032 |
| MSELoss | float32 | [10 100 100 100]-none | noncontiguous | random-lg | bwd | 2423129 | 1661950 | mseloss -I 10x100x100x100 -r none -Z 0_bwd | 1.45800355004663 |
| MSELoss | float32 | [2000 3000]-none | noncontiguous | random-lg | bwd | 547674 | 376328 | mseloss -I 2000x3000 -r none -Z 0_bwd | 1.45531026126145 |
| MSELoss | float32 | [25 100 100]-sum | contiguous | random-md | fwd | 23745 | 16676 | mseloss -I 25x100x100 -r sum -Z 1_contig | 1.42390261453586 |
| MSELoss | float32 | [25 100 100]-mean | contiguous | random-md | fwd | 23760 | 16711 | mseloss -I 25x100x100 -r mean -Z 1_contig | 1.42181796421519 |
| MSELoss | float32 | [2000 3000]-sum | contiguous | random-lg | fwd | 96514 | 68268 | mseloss -I 2000x3000 -r sum -Z 1_contig | 1.41375168453741 |
| MSELoss | float32 | [1000000]-mean | noncontiguous | random-lg | fwd | 33825 | 24231 | mseloss -I 1000000 -r mean -Z 0 | 1.39593908629442 |
| MSELoss | float32 | [32 80 870]-sum | contiguous | t5 | fwd | 48673 | 34880 | mseloss -I 32x80x870 -r sum -Z 1_contig | 1.39544151376147 |
| MSELoss | float32 | [100 20 20 20]-none | noncontiguous | random-md | bwd | 83490 | 59877 | mseloss -I 100x20x20x20 -r none -Z 0_bwd | 1.39435843479132 |
| MSELoss | float32 | [32 80 870]-mean | contiguous | t5 | fwd | 48273 | 34934 | mseloss -I 32x80x870 -r mean -Z 1_contig | 1.38183431613901 |
| MSELoss | float32 | [100 10 10 10 10]-none | noncontiguous | random-md | bwd | 98626 | 71770 | mseloss -I 100x10x10x10x10 -r none -Z 0_bwd | 1.37419534624495 |
| MSELoss | float32 | [128 80 870]-sum | contiguous | tacotron2 | fwd | 129890 | 94687 | mseloss -I 128x80x870 -r sum -Z 1_contig | 1.37178282129543 |
| MSELoss | float32 | [2000 3000]-mean | contiguous | random-lg | fwd | 94226 | 68694 | mseloss -I 2000x3000 -r mean -Z 1_contig | 1.37167729350453 |
float16
| op_name | dtype | Size-reduction | contiguous | model | direction | rocm kernel time | miopen kernel time | miopendriver cmdline | rocm/miopen |
|---|---|---|---|---|---|---|---|---|---|
| MSELoss | float16 | [25 100 100]-mean | noncontiguous | random-md | fwd | 30241 | 16124 | mselossfp16 -I 25x100x100 -r mean -Z 0 | 1.87552716447532 |
| MSELoss | float16 | [25 300]-none | noncontiguous | random-md | bwd | 18624 | 10080 | mselossfp16 -I 25x300 -r none -Z 0_bwd | 1.84761904761905 |
| MSELoss | float16 | [25 100 100]-sum | noncontiguous | random-md | fwd | 28785 | 16160 | mselossfp16 -I 25x100x100 -r sum -Z 0 | 1.78125 |
| MSELoss | float16 | [32 80]-mean | contiguous | tacotron2 | fwd | 16880 | 10506 | mselossfp16 -I 32x80 -r mean -Z 1_contig | 1.6067009328003 |
| MSELoss | float16 | [32 80]-sum | contiguous | tacotron2 | fwd | 17008 | 10649 | mselossfp16 -I 32x80 -r sum -Z 1_contig | 1.59714527185651 |
| MSELoss | float16 | [25 300]-sum | noncontiguous | random-md | bwd | 16368 | 10275 | mselossfp16 -I 25x300 -r sum -Z 0_bwd | 1.59299270072993 |
| MSELoss | float16 | [32 80]-sum | noncontiguous | tacotron2 | bwd | 19920 | 12515 | mselossfp16 -I 32x80 -r sum -Z 0_bwd | 1.59168997203356 |
| MSELoss | float16 | [25 300]-mean | noncontiguous | random-md | bwd | 16448 | 10453 | mselossfp16 -I 25x300 -r mean -Z 0_bwd | 1.5735195637616 |
| MSELoss | float16 | [32 80]-none | noncontiguous | tacotron2 | bwd | 18128 | 12035 | mselossfp16 -I 32x80 -r none -Z 0_bwd | 1.50627336933943 |
| MSELoss | float16 | [25 100 100]-sum | contiguous | random-md | fwd | 23392 | 15911 | mselossfp16 -I 25x100x100 -r sum -Z 1_contig | 1.47017786437056 |
| MSELoss | float16 | [32 80]-mean | noncontiguous | tacotron2 | bwd | 17344 | 12125 | mselossfp16 -I 32x80 -r mean -Z 0_bwd | 1.43043298969072 |
| MSELoss | float16 | [10 10 50 100 100]-none | noncontiguous | random-lg | bwd | 11586974 | 8183960 | mselossfp16 -I 10x10x50x100x100 -r none -Z 0_bwd | 1.41581508218515 |
| MSELoss | float16 | [25 100 100]-mean | contiguous | random-md | fwd | 22784 | 16107 | mselossfp16 -I 25x100x100 -r mean -Z 1_contig | 1.41454026199789 |
| MSELoss | float16 | [25 100 100]-none | noncontiguous | random-md | bwd | 30769 | 22134 | mselossfp16 -I 25x100x100 -r none -Z 0_bwd | 1.39012379145206 |
| MSELoss | float16 | [100 20 20 20]-none | noncontiguous | random-md | bwd | 75393 | 57352 | mselossfp16 -I 100x20x20x20 -r none -Z 0_bwd | 1.31456618775282 |
| MSELoss | float16 | [100 10 10 10 10]-none | noncontiguous | random-md | bwd | 89410 | 68943 | mselossfp16 -I 100x10x10x10x10 -r none -Z 0_bwd | 1.29686842754159 |
| MSELoss | float16 | [2000 3000]-none | noncontiguous | random-lg | bwd | 463896 | 360150 | mselossfp16 -I 2000x3000 -r none -Z 0_bwd | 1.28806330695544 |
| MSELoss | float16 | [10 100 100 100]-none | noncontiguous | random-lg | bwd | 1869280 | 1504310 | mselossfp16 -I 10x100x100x100 -r none -Z 0_bwd | 1.24261621607249 |
| MSELoss | float16 | [25 100 100]-mean | noncontiguous | random-md | bwd | 26752 | 21618 | mselossfp16 -I 25x100x100 -r mean -Z 0_bwd | 1.23748727911925 |
| MSELoss | float16 | [1000000]-mean | contiguous | random-lg | fwd | 29456 | 24249 | mselossfp16 -I 1000000 -r mean -Z 1_contig | 1.21473050435069 |
bfloat16
| op_name | dtype | Size-reduction | contiguous | model | direction | rocm kernel time | miopen kernel time | miopendriver cmdline | rocm/miopen |
|---|---|---|---|---|---|---|---|---|---|
| MSELoss | bfloat16 | [100 20 20 20]-sum | noncontiguous | random-md | fwd | 43889 | 22329 | mselossbfp16 -I 100x20x20x20 -r sum -Z 0 | 1.96556048188454 |
| MSELoss | bfloat16 | [25 300]-sum | contiguous | random-md | fwd | 19904 | 10169 | mselossbfp16 -I 25x300 -r sum -Z 1_contig | 1.95732127052808 |
| MSELoss | bfloat16 | [100 10 10 10 10]-mean | noncontiguous | random-md | fwd | 48049 | 24693 | mselossbfp16 -I 100x10x10x10x10 -r mean -Z 0 | 1.94585510063581 |
| MSELoss | bfloat16 | [10000]-mean | noncontiguous | random-md | fwd | 19664 | 10151 | mselossbfp16 -I 10000 -r mean -Z 0 | 1.93714904935474 |
| MSELoss | bfloat16 | [100 20 20 20]-mean | noncontiguous | random-md | fwd | 42273 | 22347 | mselossbfp16 -I 100x20x20x20 -r mean -Z 0 | 1.89166331051148 |
| MSELoss | bfloat16 | [25 300]-none | noncontiguous | random-md | bwd | 19216 | 10293 | mselossbfp16 -I 25x300 -r none -Z 0_bwd | 1.86689983483921 |
| MSELoss | bfloat16 | [25 100 100]-sum | noncontiguous | random-md | fwd | 29441 | 16302 | mselossbfp16 -I 25x100x100 -r sum -Z 0 | 1.80597472702736 |
| MSELoss | bfloat16 | [25 100 100]-mean | noncontiguous | random-md | fwd | 28848 | 16107 | mselossbfp16 -I 25x100x100 -r mean -Z 0 | 1.79102253678525 |
| MSELoss | bfloat16 | [32 80]-mean | contiguous | tacotron2 | fwd | 17456 | 10240 | mselossbfp16 -I 32x80 -r mean -Z 1_contig | 1.7046875 |
| MSELoss | bfloat16 | [32 80]-sum | contiguous | tacotron2 | fwd | 17616 | 10507 | mselossbfp16 -I 32x80 -r sum -Z 1_contig | 1.67659655467783 |
| MSELoss | bfloat16 | [32 80]-sum | noncontiguous | tacotron2 | bwd | 21296 | 13013 | mselossbfp16 -I 32x80 -r sum -Z 0_bwd | 1.63651732882502 |
| MSELoss | bfloat16 | [25 100 100]-sum | contiguous | random-md | fwd | 24832 | 15982 | mselossbfp16 -I 25x100x100 -r sum -Z 1_contig | 1.55374796646227 |
| MSELoss | bfloat16 | [25 100 100]-mean | contiguous | random-md | fwd | 24592 | 15911 | mselossbfp16 -I 25x100x100 -r mean -Z 1_contig | 1.5455973854566 |
| MSELoss | bfloat16 | [10 100 100 100]-none | noncontiguous | random-lg | bwd | 1945089 | 1296260 | mselossbfp16 -I 10x100x100x100 -r none -Z 0_bwd | 1.50053924367025 |
| MSELoss | bfloat16 | [10 10 50 100 100]-none | noncontiguous | random-lg | bwd | 11904466 | 8223780 | mselossbfp16 -I 10x10x50x100x100 -r none -Z 0_bwd | 1.44756620434885 |
| MSELoss | bfloat16 | [25 300]-sum | noncontiguous | random-md | bwd | 15104 | 10595 | mselossbfp16 -I 25x300 -r sum -Z 0_bwd | 1.42557810287872 |
| MSELoss | bfloat16 | [25 300]-mean | noncontiguous | random-md | bwd | 14992 | 10649 | mselossbfp16 -I 25x300 -r mean -Z 0_bwd | 1.40783172128838 |
| MSELoss | bfloat16 | [25 100 100]-none | noncontiguous | random-md | bwd | 31489 | 22685 | mselossbfp16 -I 25x100x100 -r none -Z 0_bwd | 1.38809786202336 |
| MSELoss | bfloat16 | [32 80]-none | noncontiguous | tacotron2 | bwd | 17792 | 13155 | mselossbfp16 -I 32x80 -r none -Z 0_bwd | 1.35248954770049 |
| MSELoss | bfloat16 | [32 80]-mean | noncontiguous | tacotron2 | bwd | 16928 | 12835 | mselossbfp16 -I 32x80 -r mean -Z 0_bwd | 1.3188936501753 |
Average performance
MSELoss
| dtype | forward | backward |
|---|---|---|
| float32 | 2.52 | 0.74 |
| float16 | 2.08 | 0.66 |
| bfloat16 | 2.12 | 0.66 |
MSELossUnreduced
| dtype | forward | backward |
|---|---|---|
| float32 | 0.53 | 0.92 |
| float16 | 0.45 | 0.82 |
| bfloat16 | 0.49 | 0.86 |
Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.
Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts
Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own
~~Hmmm... is the Static build the same as setting -DMIOPEN_EMBED_BUILD=On? Because if I set that on my local build it seems that even develop failed to build.~~ nvm it's something else
Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause)
This algorithm is very similar to https://github.com/ROCm/MIOpen/pull/3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too?
Also could you remove GPU specific parts from CPU implementation (more details and in this comment https://github.com/ROCm/MIOpen/pull/3143#discussion_r1711335230) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)
Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)
Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.
Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)
Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.
Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another.
There are only two ways to get it fixed and both require to get rid of GPU-specific code.
- increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
- if increasing tolerance is not possible, the algorithm must be improved, because it does not produce mathematically correct result with acceptable error.
Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.
It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.
it calculates something else
Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.
increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.
Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.
It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.
I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.
Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.
If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm.
We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators.
I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that
@CAHEK7 please take a look the latest changes and comments if you have some concerns.
git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews
MIOpen is moving to the new monorepo setup and all older unmerged PR's are being closed. Please re-open this as part of the new repo if these changes are still needed.