MIOpen Implement MSELoss Function

This PR ports the MSELoss family of loss function to MIOpen:

MSELoss
MSELossUnreduced

Performance measurements seems to suggest that in general we're performing better than ROCm on forward, reduced operation (mostly thanks to parallel reduction).

Sample performance measurements

float32

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	float32	[32 80]-sum	noncontiguous	tacotron2	bwd	20784	11804	mseloss -I 32x80 -r sum -Z 0_bwd	1.76075906472382
MSELoss	float32	[25 100 100]-mean	noncontiguous	random-md	fwd	27568	16551	mseloss -I 25x100x100 -r mean -Z 0	1.66563953839647
MSELoss	float32	[25 300]-sum	noncontiguous	random-md	bwd	15888	10204	mseloss -I 25x300 -r sum -Z 0_bwd	1.55703645629165
MSELoss	float32	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	14407533	9301430	mseloss -I 10x10x50x100x100 -r none -Z 0_bwd	1.54895892352036
MSELoss	float32	[32 80]-mean	contiguous	tacotron2	fwd	16544	10791	mseloss -I 32x80 -r mean -Z 1_contig	1.53312945973496
MSELoss	float32	[25 300]-mean	noncontiguous	random-md	bwd	15632	10400	mseloss -I 25x300 -r mean -Z 0_bwd	1.50307692307692
MSELoss	float32	[32 80]-none	noncontiguous	tacotron2	bwd	17760	12089	mseloss -I 32x80 -r none -Z 0_bwd	1.46910414426338
MSELoss	float32	[32 80]-sum	contiguous	tacotron2	fwd	15952	10862	mseloss -I 32x80 -r sum -Z 1_contig	1.46860614988032
MSELoss	float32	[10 100 100 100]-none	noncontiguous	random-lg	bwd	2423129	1661950	mseloss -I 10x100x100x100 -r none -Z 0_bwd	1.45800355004663
MSELoss	float32	[2000 3000]-none	noncontiguous	random-lg	bwd	547674	376328	mseloss -I 2000x3000 -r none -Z 0_bwd	1.45531026126145
MSELoss	float32	[25 100 100]-sum	contiguous	random-md	fwd	23745	16676	mseloss -I 25x100x100 -r sum -Z 1_contig	1.42390261453586
MSELoss	float32	[25 100 100]-mean	contiguous	random-md	fwd	23760	16711	mseloss -I 25x100x100 -r mean -Z 1_contig	1.42181796421519
MSELoss	float32	[2000 3000]-sum	contiguous	random-lg	fwd	96514	68268	mseloss -I 2000x3000 -r sum -Z 1_contig	1.41375168453741
MSELoss	float32	[1000000]-mean	noncontiguous	random-lg	fwd	33825	24231	mseloss -I 1000000 -r mean -Z 0	1.39593908629442
MSELoss	float32	[32 80 870]-sum	contiguous	t5	fwd	48673	34880	mseloss -I 32x80x870 -r sum -Z 1_contig	1.39544151376147
MSELoss	float32	[100 20 20 20]-none	noncontiguous	random-md	bwd	83490	59877	mseloss -I 100x20x20x20 -r none -Z 0_bwd	1.39435843479132
MSELoss	float32	[32 80 870]-mean	contiguous	t5	fwd	48273	34934	mseloss -I 32x80x870 -r mean -Z 1_contig	1.38183431613901
MSELoss	float32	[100 10 10 10 10]-none	noncontiguous	random-md	bwd	98626	71770	mseloss -I 100x10x10x10x10 -r none -Z 0_bwd	1.37419534624495
MSELoss	float32	[128 80 870]-sum	contiguous	tacotron2	fwd	129890	94687	mseloss -I 128x80x870 -r sum -Z 1_contig	1.37178282129543
MSELoss	float32	[2000 3000]-mean	contiguous	random-lg	fwd	94226	68694	mseloss -I 2000x3000 -r mean -Z 1_contig	1.37167729350453

float16

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	float16	[25 100 100]-mean	noncontiguous	random-md	fwd	30241	16124	mselossfp16 -I 25x100x100 -r mean -Z 0	1.87552716447532
MSELoss	float16	[25 300]-none	noncontiguous	random-md	bwd	18624	10080	mselossfp16 -I 25x300 -r none -Z 0_bwd	1.84761904761905
MSELoss	float16	[25 100 100]-sum	noncontiguous	random-md	fwd	28785	16160	mselossfp16 -I 25x100x100 -r sum -Z 0	1.78125
MSELoss	float16	[32 80]-mean	contiguous	tacotron2	fwd	16880	10506	mselossfp16 -I 32x80 -r mean -Z 1_contig	1.6067009328003
MSELoss	float16	[32 80]-sum	contiguous	tacotron2	fwd	17008	10649	mselossfp16 -I 32x80 -r sum -Z 1_contig	1.59714527185651
MSELoss	float16	[25 300]-sum	noncontiguous	random-md	bwd	16368	10275	mselossfp16 -I 25x300 -r sum -Z 0_bwd	1.59299270072993
MSELoss	float16	[32 80]-sum	noncontiguous	tacotron2	bwd	19920	12515	mselossfp16 -I 32x80 -r sum -Z 0_bwd	1.59168997203356
MSELoss	float16	[25 300]-mean	noncontiguous	random-md	bwd	16448	10453	mselossfp16 -I 25x300 -r mean -Z 0_bwd	1.5735195637616
MSELoss	float16	[32 80]-none	noncontiguous	tacotron2	bwd	18128	12035	mselossfp16 -I 32x80 -r none -Z 0_bwd	1.50627336933943
MSELoss	float16	[25 100 100]-sum	contiguous	random-md	fwd	23392	15911	mselossfp16 -I 25x100x100 -r sum -Z 1_contig	1.47017786437056
MSELoss	float16	[32 80]-mean	noncontiguous	tacotron2	bwd	17344	12125	mselossfp16 -I 32x80 -r mean -Z 0_bwd	1.43043298969072
MSELoss	float16	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	11586974	8183960	mselossfp16 -I 10x10x50x100x100 -r none -Z 0_bwd	1.41581508218515
MSELoss	float16	[25 100 100]-mean	contiguous	random-md	fwd	22784	16107	mselossfp16 -I 25x100x100 -r mean -Z 1_contig	1.41454026199789
MSELoss	float16	[25 100 100]-none	noncontiguous	random-md	bwd	30769	22134	mselossfp16 -I 25x100x100 -r none -Z 0_bwd	1.39012379145206
MSELoss	float16	[100 20 20 20]-none	noncontiguous	random-md	bwd	75393	57352	mselossfp16 -I 100x20x20x20 -r none -Z 0_bwd	1.31456618775282
MSELoss	float16	[100 10 10 10 10]-none	noncontiguous	random-md	bwd	89410	68943	mselossfp16 -I 100x10x10x10x10 -r none -Z 0_bwd	1.29686842754159
MSELoss	float16	[2000 3000]-none	noncontiguous	random-lg	bwd	463896	360150	mselossfp16 -I 2000x3000 -r none -Z 0_bwd	1.28806330695544
MSELoss	float16	[10 100 100 100]-none	noncontiguous	random-lg	bwd	1869280	1504310	mselossfp16 -I 10x100x100x100 -r none -Z 0_bwd	1.24261621607249
MSELoss	float16	[25 100 100]-mean	noncontiguous	random-md	bwd	26752	21618	mselossfp16 -I 25x100x100 -r mean -Z 0_bwd	1.23748727911925
MSELoss	float16	[1000000]-mean	contiguous	random-lg	fwd	29456	24249	mselossfp16 -I 1000000 -r mean -Z 1_contig	1.21473050435069

bfloat16

op_name	dtype	Size-reduction	contiguous	model	direction	rocm kernel time	miopen kernel time	miopendriver cmdline	rocm/miopen
MSELoss	bfloat16	[100 20 20 20]-sum	noncontiguous	random-md	fwd	43889	22329	mselossbfp16 -I 100x20x20x20 -r sum -Z 0	1.96556048188454
MSELoss	bfloat16	[25 300]-sum	contiguous	random-md	fwd	19904	10169	mselossbfp16 -I 25x300 -r sum -Z 1_contig	1.95732127052808
MSELoss	bfloat16	[100 10 10 10 10]-mean	noncontiguous	random-md	fwd	48049	24693	mselossbfp16 -I 100x10x10x10x10 -r mean -Z 0	1.94585510063581
MSELoss	bfloat16	[10000]-mean	noncontiguous	random-md	fwd	19664	10151	mselossbfp16 -I 10000 -r mean -Z 0	1.93714904935474
MSELoss	bfloat16	[100 20 20 20]-mean	noncontiguous	random-md	fwd	42273	22347	mselossbfp16 -I 100x20x20x20 -r mean -Z 0	1.89166331051148
MSELoss	bfloat16	[25 300]-none	noncontiguous	random-md	bwd	19216	10293	mselossbfp16 -I 25x300 -r none -Z 0_bwd	1.86689983483921
MSELoss	bfloat16	[25 100 100]-sum	noncontiguous	random-md	fwd	29441	16302	mselossbfp16 -I 25x100x100 -r sum -Z 0	1.80597472702736
MSELoss	bfloat16	[25 100 100]-mean	noncontiguous	random-md	fwd	28848	16107	mselossbfp16 -I 25x100x100 -r mean -Z 0	1.79102253678525
MSELoss	bfloat16	[32 80]-mean	contiguous	tacotron2	fwd	17456	10240	mselossbfp16 -I 32x80 -r mean -Z 1_contig	1.7046875
MSELoss	bfloat16	[32 80]-sum	contiguous	tacotron2	fwd	17616	10507	mselossbfp16 -I 32x80 -r sum -Z 1_contig	1.67659655467783
MSELoss	bfloat16	[32 80]-sum	noncontiguous	tacotron2	bwd	21296	13013	mselossbfp16 -I 32x80 -r sum -Z 0_bwd	1.63651732882502
MSELoss	bfloat16	[25 100 100]-sum	contiguous	random-md	fwd	24832	15982	mselossbfp16 -I 25x100x100 -r sum -Z 1_contig	1.55374796646227
MSELoss	bfloat16	[25 100 100]-mean	contiguous	random-md	fwd	24592	15911	mselossbfp16 -I 25x100x100 -r mean -Z 1_contig	1.5455973854566
MSELoss	bfloat16	[10 100 100 100]-none	noncontiguous	random-lg	bwd	1945089	1296260	mselossbfp16 -I 10x100x100x100 -r none -Z 0_bwd	1.50053924367025
MSELoss	bfloat16	[10 10 50 100 100]-none	noncontiguous	random-lg	bwd	11904466	8223780	mselossbfp16 -I 10x10x50x100x100 -r none -Z 0_bwd	1.44756620434885
MSELoss	bfloat16	[25 300]-sum	noncontiguous	random-md	bwd	15104	10595	mselossbfp16 -I 25x300 -r sum -Z 0_bwd	1.42557810287872
MSELoss	bfloat16	[25 300]-mean	noncontiguous	random-md	bwd	14992	10649	mselossbfp16 -I 25x300 -r mean -Z 0_bwd	1.40783172128838
MSELoss	bfloat16	[25 100 100]-none	noncontiguous	random-md	bwd	31489	22685	mselossbfp16 -I 25x100x100 -r none -Z 0_bwd	1.38809786202336
MSELoss	bfloat16	[32 80]-none	noncontiguous	tacotron2	bwd	17792	13155	mselossbfp16 -I 32x80 -r none -Z 0_bwd	1.35248954770049
MSELoss	bfloat16	[32 80]-mean	noncontiguous	tacotron2	bwd	16928	12835	mselossbfp16 -I 32x80 -r mean -Z 0_bwd	1.3188936501753

Average performance

MSELoss

dtype	forward	backward
float32	2.52	0.74
float16	2.08	0.66
bfloat16	2.12	0.66

MSELossUnreduced

dtype	forward	backward
float32	0.53	0.92
float16	0.45	0.82
bfloat16	0.49	0.86

Codepaths that do not yield a sufficient performance gains have been cordoned and made unavailable.

Jul 26 '24 10:07 o2buzzle

Please note: This PR is sharing some code (particularly warp-level reduction, tensor view, etc.) with some other code in this group of Moreh’s upstream requests. We’ll consolidate them as they gets closer to being merged to avoid merge conflicts

Jul 26 '24 10:07 o2buzzle

Also, is there any method we can use to view the runner's output for easier debugging? I guess we can just ask you for the failing test, but it might be easier and less work for you if we can just see what is wrong on our own

Jul 29 '24 04:07 o2buzzle

~~Hmmm... is the Static build the same as setting -DMIOPEN_EMBED_BUILD=On? Because if I set that on my local build it seems that even develop failed to build.~~ nvm it's something else

Aug 01 '24 04:08 o2buzzle

Windows build is not passing, but that is to be expected (please check #2970, previous conversations seems to suggest it was the cause)

Aug 02 '24 02:08 o2buzzle

This algorithm is very similar to https://github.com/ROCm/MIOpen/pull/3143 could you explain why do you use different indexing scheme? @BuiChiTrung could you help too?

Also could you remove GPU specific parts from CPU implementation (more details and in this comment https://github.com/ROCm/MIOpen/pull/3143#discussion_r1711335230) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Aug 09 '24 12:08 CAHEK7

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

Aug 09 '24 16:08 o2buzzle

Also could you remove GPU specific parts from CPU implementation (more details and in this comment #3143 (comment)) And may I ask you to align the test to the latest test design document? (https://github.com/ROCm/MIOpen/wiki/GTest-development)

Greatest enemy here is reduction, particularly because floating point computations are handled differently between processors and ordering. You can check why I had to have a whole section to literally just mimic how parallel warp-level reduction on GPUs would behave in our downstream conversations here in order for verification to work.

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way. The algorithm implies finding an average across some dimension and mathematically it should just accumulate all the numbers and divide by the number of elements. If during the accumulation you've got so huge error that you have to mimic GPU behavior over straight-forward accumulation, it means the algorithm does not perform mean calculation, it calculates something else and if you alter, for example, block size, it will calculate something another.

There are only two ways to get it fixed and both require to get rid of GPU-specific code.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.
if increasing tolerance is not possible, the algorithm must be improved, because it does not produce mathematically correct result with acceptable error.

Aug 09 '24 16:08 CAHEK7

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

it calculates something else

Not sure if this really is calculating something else. It is, for what its worth, adding all the values together, then divide them by another value (or reversed in this case, divide them within each threads, and adding them up). Again, the issue here is what kind of floating point-induced errors are we willing to accept.

increase error tolerance, if it is acceptable. We are comparing against mathematical meaning of the algorithm.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

Aug 09 '24 17:08 o2buzzle

Such huge (really huge?) error means that the kernel doesn't perform reduction in acceptable way.

It’s not technically speaking, unacceptable, it’s just a side effect of when doing parallel reduction in this manner (causes the ordering of the added floating point values to differ from straight serial addition). If you want it to “match” what a naive addition would be doing, the only “acceptable” way would literally be just that (pull all values back to host, add them there). Then it has a chance of matching up.

I'm perfectly aware about parallel reduction pitfalls and fp operations. Lucky we are not using atomics here. But again - verification algorithm must be algorithm agnostic and as generic as possible.

Probably this one, but I think it would be slightly difficult to come up with a number that would both cover all cases and not be comically huge.

If there is not accepted tolerance for some input data, it just means that the algorithm is not applicable for that case, and this problem must not be hidden by perfectly tuned verification algorithm.

We can implement another version of MSELoss and we will use the same verification algorithm for it, because it is still MSELoss. And it's a way how to compare precision and stability between algorithms - using the same, pretty close to mathematical meaning, naive implementation, probably with even higher precision accumulators.

Aug 09 '24 19:08 CAHEK7

I'm also working on integrating @long10024070 's MIOpenReduceSum into the reduction part and remove that duplicated code. Although due to some reorganization, please do expect some delays on that

Aug 22 '24 11:08 o2buzzle

@CAHEK7 please take a look the latest changes and comments if you have some concerns.

Sep 24 '24 17:09 iq136boy

git tree got unreadable last merge attempt, I think I will just squash + rebase everything. Makes it easier for final reviews

Sep 30 '24 03:09 o2buzzle

MIOpen is moving to the new monorepo setup and all older unmerged PR's are being closed. Please re-open this as part of the new repo if these changes are still needed.

Jul 14 '25 06:07 BradPepersAMD