AMDMIGraphX Layernorm Perf improvement: increase wg size

Increase performance of layernorm for unet model by ~ 33%: tensor {float, 2, 32, 81920} was evaluated as a test case.

Jun 19 '24 22:06 lakhinderwalia

What happens to the performance for sizes like {float, 2, 32, 512} and {float, 2, 32, 1024}?

Jun 19 '24 22:06 pfultz2

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.05%. Comparing base (f694f25) to head (0c1b308). Report is 165 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3202   +/-   ##
========================================
  Coverage    92.05%   92.05%           
========================================
  Files          506      506           
  Lines        20837    20837           
========================================
  Hits         19181    19181           
  Misses        1656     1656

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 19 '24 23:06 codecov[bot]

What happens to the performance for sizes like {float, 2, 32, 512} and {float, 2, 32, 1024}?

I see a smaller perf improvement. Maybe 10% is a number I can quote for the smaller kernel here, but the variation is an important caveat here, and it is hard to be very precise. Thanks.

Jun 20 '24 01:06 lakhinderwalia

What happens to the performance for sizes like {float, 2, 32, 512} and {float, 2, 32, 1024}?

Actually these sizes are too small, as they still use block size of 256.

I ran this across different sizes to compare where it would pick a different block size:

develop:
gpu::code_object[code_object=10256,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 2096}, {67072, 2096, 1}: 0.110351ms
gpu::code_object[code_object=11024,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 8192}, {262144, 8192, 1}: 0.428912ms
gpu::code_object[code_object=25936,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 81920}, {2621440, 81920, 1}: 4.54644ms
gpu::code_object[code_object=9744,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 327680}, {10485760, 327680, 1}: 32.6882ms
PR:
gpu::code_object[code_object=10000,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 2096}, {67072, 2096, 1}: 0.107564ms
gpu::code_object[code_object=10256,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 8192}, {262144, 8192, 1}: 0.42505ms
gpu::code_object[code_object=16976,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 81920}, {2621440, 81920, 1}: 4.69303ms
gpu::code_object[code_object=60432,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 327680}, {10485760, 327680, 1}: 53.006ms

With smaller sizes like 2096 its 10% faster, where as 8192 there is almost no difference. However, once it gets larger it get slower. 81920 is only 3% slower, but 327680 is 60% slower. I dont think this is a good heuristic.

Without sampling a bunch of sizes to get a better formula for the block size, I think it would be better to just make the block size a tuning parameter so we can time it for each size we see.

Jun 20 '24 02:06 pfultz2

I am a strong proponent of the block size as a tuning parameter, based on my detailed testing results posted elsewhere. So agree on that future direction, which would apply not just to this operator, because there would be many sizes where other operators today with a fixed block size would show a degradation, is my guess.

But, bs=512 for layernorm didn't degrade in my tests while compared to bs=256. And that is why I recommend it in the PR. I couldn't run {256, 32, 327680}, which failed to run. The large size of {256, 32, 81920} ran approx. 15% improvement seen with BS=512, compared to BS=256.

(BTW, I use modified sizes in test_layernorm_test -- which actually is a layernorm+mul+add kernel'.)

Jun 20 '24 02:06 lakhinderwalia

Test Batch Rate new
f7d855 Rate old
2b4d32 Diff Compare

torchvision-resnet50 64 1,734.93 1,745.12 -0.58% :white_check_mark:

torchvision-resnet50_fp16 64 4,031.06 4,046.92 -0.39% :white_check_mark:

torchvision-densenet121 32 1,462.16 1,462.03 0.01% :white_check_mark:

torchvision-densenet121_fp16 32 2,509.21 2,523.62 -0.57% :white_check_mark:

torchvision-inceptionv3 32 873.41 878.17 -0.54% :white_check_mark:

torchvision-inceptionv3_fp16 32 1,480.13 1,485.88 -0.39% :white_check_mark:

cadene-inceptionv4 16 405.37 407.23 -0.46% :white_check_mark:

cadene-resnext64x4 16 417.04 419.32 -0.54% :white_check_mark:

slim-mobilenet 64 4,071.40 4,093.30 -0.54% :white_check_mark:

slim-nasnetalarge 64 100.64 101.13 -0.49% :white_check_mark:

slim-resnet50v2 64 1,676.56 1,685.74 -0.54% :white_check_mark:

bert-mrpc-onnx 8 614.71 616.23 -0.25% :white_check_mark:

bert-mrpc-tf 1 278.70 280.28 -0.56% :white_check_mark:

pytorch-examples-wlang-gru 1 318.92 319.61 -0.22% :white_check_mark:

pytorch-examples-wlang-lstm 1 294.38 289.43 1.71% :white_check_mark:

torchvision-resnet50_1 1 470.40 470.04 0.08% :white_check_mark:

cadene-dpn92_1 1 248.92 246.50 0.98% :white_check_mark:

cadene-resnext101_1 1 204.32 205.62 -0.63% :white_check_mark:

onnx-taau-downsample 1 203.99 204.80 -0.39% :white_check_mark:

dlrm-criteoterabyte 1 22.82 22.91 -0.36% :white_check_mark:

dlrm-criteoterabyte_fp16 1 42.59 42.68 -0.22% :white_check_mark:

agentmodel 1 6,266.06 7,833.18 -20.01% :red_circle:

unet_fp16 2 34.46 34.24 0.64% :white_check_mark:

resnet50v1_fp16 1 583.97 583.34 0.11% :white_check_mark:

resnet50v1_int8 1 585.32 581.62 0.64% :white_check_mark:

bert_base_cased_fp16 64 642.51 646.23 -0.58% :white_check_mark:

bert_large_uncased_fp16 32 197.81 199.02 -0.61% :white_check_mark:

bert_large_fp16 1 117.42 117.58 -0.14% :white_check_mark:

distilgpt2_fp16 16 1,204.26 1,211.17 -0.57% :white_check_mark:

yolov5s 1 300.23 301.48 -0.41% :white_check_mark:

tinyllama 1 23.21 23.32 -0.47% :white_check_mark:

vicuna-fastchat 1 133.27 133.07 0.15% :white_check_mark:

whisper-tiny-encoder 1 243.21 244.29 -0.44% :white_check_mark:

whisper-tiny-decoder 1 255.47 256.34 -0.34% :white_check_mark:

Test	Batch	Rate new f7d855	Rate old 2b4d32	Diff	Compare
torchvision-resnet50	64	1,734.93	1,745.12	-0.58%	:white_check_mark:
torchvision-resnet50_fp16	64	4,031.06	4,046.92	-0.39%	:white_check_mark:
torchvision-densenet121	32	1,462.16	1,462.03	0.01%	:white_check_mark:
torchvision-densenet121_fp16	32	2,509.21	2,523.62	-0.57%	:white_check_mark:
torchvision-inceptionv3	32	873.41	878.17	-0.54%	:white_check_mark:
torchvision-inceptionv3_fp16	32	1,480.13	1,485.88	-0.39%	:white_check_mark:
cadene-inceptionv4	16	405.37	407.23	-0.46%	:white_check_mark:
cadene-resnext64x4	16	417.04	419.32	-0.54%	:white_check_mark:
slim-mobilenet	64	4,071.40	4,093.30	-0.54%	:white_check_mark:
slim-nasnetalarge	64	100.64	101.13	-0.49%	:white_check_mark:
slim-resnet50v2	64	1,676.56	1,685.74	-0.54%	:white_check_mark:
bert-mrpc-onnx	8	614.71	616.23	-0.25%	:white_check_mark:
bert-mrpc-tf	1	278.70	280.28	-0.56%	:white_check_mark:
pytorch-examples-wlang-gru	1	318.92	319.61	-0.22%	:white_check_mark:
pytorch-examples-wlang-lstm	1	294.38	289.43	1.71%	:white_check_mark:
torchvision-resnet50_1	1	470.40	470.04	0.08%	:white_check_mark:
cadene-dpn92_1	1	248.92	246.50	0.98%	:white_check_mark:
cadene-resnext101_1	1	204.32	205.62	-0.63%	:white_check_mark:
onnx-taau-downsample	1	203.99	204.80	-0.39%	:white_check_mark:
dlrm-criteoterabyte	1	22.82	22.91	-0.36%	:white_check_mark:
dlrm-criteoterabyte_fp16	1	42.59	42.68	-0.22%	:white_check_mark:
agentmodel	1	6,266.06	7,833.18	-20.01%	:red_circle:
unet_fp16	2	34.46	34.24	0.64%	:white_check_mark:
resnet50v1_fp16	1	583.97	583.34	0.11%	:white_check_mark:
resnet50v1_int8	1	585.32	581.62	0.64%	:white_check_mark:
bert_base_cased_fp16	64	642.51	646.23	-0.58%	:white_check_mark:
bert_large_uncased_fp16	32	197.81	199.02	-0.61%	:white_check_mark:
bert_large_fp16	1	117.42	117.58	-0.14%	:white_check_mark:
distilgpt2_fp16	16	1,204.26	1,211.17	-0.57%	:white_check_mark:
yolov5s	1	300.23	301.48	-0.41%	:white_check_mark:
tinyllama	1	23.21	23.32	-0.47%	:white_check_mark:
vicuna-fastchat	1	133.27	133.07	0.15%	:white_check_mark:
whisper-tiny-encoder	1	243.21	244.29	-0.44%	:white_check_mark:
whisper-tiny-decoder	1	255.47	256.34	-0.34%	:white_check_mark:

This build is not recommended to merge :red_circle:

Jun 21 '24 03:06 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Jun 21 '24 03:06 migraphx-bot