Layernorm Perf improvement: increase wg size
Increase performance of layernorm for unet model by ~ 33%: tensor {float, 2, 32, 81920} was evaluated as a test case.
What happens to the performance for sizes like {float, 2, 32, 512} and {float, 2, 32, 1024}?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.05%. Comparing base (
f694f25) to head (0c1b308). Report is 165 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #3202 +/- ##
========================================
Coverage 92.05% 92.05%
========================================
Files 506 506
Lines 20837 20837
========================================
Hits 19181 19181
Misses 1656 1656
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
What happens to the performance for sizes like
{float, 2, 32, 512}and{float, 2, 32, 1024}?
I see a smaller perf improvement. Maybe 10% is a number I can quote for the smaller kernel here, but the variation is an important caveat here, and it is hard to be very precise. Thanks.
What happens to the performance for sizes like {float, 2, 32, 512} and {float, 2, 32, 1024}?
Actually these sizes are too small, as they still use block size of 256.
I ran this across different sizes to compare where it would pick a different block size:
develop:
gpu::code_object[code_object=10256,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 2096}, {67072, 2096, 1}: 0.110351ms
gpu::code_object[code_object=11024,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 8192}, {262144, 8192, 1}: 0.428912ms
gpu::code_object[code_object=25936,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 81920}, {2621440, 81920, 1}: 4.54644ms
gpu::code_object[code_object=9744,symbol_name=layernorm_kernel,global=2097152,local=256,] -> float_type, {256, 32, 327680}, {10485760, 327680, 1}: 32.6882ms
PR:
gpu::code_object[code_object=10000,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 2096}, {67072, 2096, 1}: 0.107564ms
gpu::code_object[code_object=10256,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 8192}, {262144, 8192, 1}: 0.42505ms
gpu::code_object[code_object=16976,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 81920}, {2621440, 81920, 1}: 4.69303ms
gpu::code_object[code_object=60432,symbol_name=layernorm_kernel,global=4194304,local=512,] -> float_type, {256, 32, 327680}, {10485760, 327680, 1}: 53.006ms
With smaller sizes like 2096 its 10% faster, where as 8192 there is almost no difference. However, once it gets larger it get slower. 81920 is only 3% slower, but 327680 is 60% slower. I dont think this is a good heuristic.
Without sampling a bunch of sizes to get a better formula for the block size, I think it would be better to just make the block size a tuning parameter so we can time it for each size we see.
I am a strong proponent of the block size as a tuning parameter, based on my detailed testing results posted elsewhere. So agree on that future direction, which would apply not just to this operator, because there would be many sizes where other operators today with a fixed block size would show a degradation, is my guess.
But, bs=512 for layernorm didn't degrade in my tests while compared to bs=256. And that is why I recommend it in the PR. I couldn't run {256, 32, 327680}, which failed to run. The large size of {256, 32, 81920} ran approx. 15% improvement seen with BS=512, compared to BS=256.
(BTW, I use modified sizes in test_layernorm_test -- which actually is a layernorm+mul+add kernel'.)
| Test | Batch | Rate new f7d855 |
Rate old 2b4d32 |
Diff | Compare |
|---|---|---|---|---|---|
| torchvision-resnet50 | 64 | 1,734.93 | 1,745.12 | -0.58% | :white_check_mark: |
| torchvision-resnet50_fp16 | 64 | 4,031.06 | 4,046.92 | -0.39% | :white_check_mark: |
| torchvision-densenet121 | 32 | 1,462.16 | 1,462.03 | 0.01% | :white_check_mark: |
| torchvision-densenet121_fp16 | 32 | 2,509.21 | 2,523.62 | -0.57% | :white_check_mark: |
| torchvision-inceptionv3 | 32 | 873.41 | 878.17 | -0.54% | :white_check_mark: |
| torchvision-inceptionv3_fp16 | 32 | 1,480.13 | 1,485.88 | -0.39% | :white_check_mark: |
| cadene-inceptionv4 | 16 | 405.37 | 407.23 | -0.46% | :white_check_mark: |
| cadene-resnext64x4 | 16 | 417.04 | 419.32 | -0.54% | :white_check_mark: |
| slim-mobilenet | 64 | 4,071.40 | 4,093.30 | -0.54% | :white_check_mark: |
| slim-nasnetalarge | 64 | 100.64 | 101.13 | -0.49% | :white_check_mark: |
| slim-resnet50v2 | 64 | 1,676.56 | 1,685.74 | -0.54% | :white_check_mark: |
| bert-mrpc-onnx | 8 | 614.71 | 616.23 | -0.25% | :white_check_mark: |
| bert-mrpc-tf | 1 | 278.70 | 280.28 | -0.56% | :white_check_mark: |
| pytorch-examples-wlang-gru | 1 | 318.92 | 319.61 | -0.22% | :white_check_mark: |
| pytorch-examples-wlang-lstm | 1 | 294.38 | 289.43 | 1.71% | :white_check_mark: |
| torchvision-resnet50_1 | 1 | 470.40 | 470.04 | 0.08% | :white_check_mark: |
| cadene-dpn92_1 | 1 | 248.92 | 246.50 | 0.98% | :white_check_mark: |
| cadene-resnext101_1 | 1 | 204.32 | 205.62 | -0.63% | :white_check_mark: |
| onnx-taau-downsample | 1 | 203.99 | 204.80 | -0.39% | :white_check_mark: |
| dlrm-criteoterabyte | 1 | 22.82 | 22.91 | -0.36% | :white_check_mark: |
| dlrm-criteoterabyte_fp16 | 1 | 42.59 | 42.68 | -0.22% | :white_check_mark: |
| agentmodel | 1 | 6,266.06 | 7,833.18 | -20.01% | :red_circle: |
| unet_fp16 | 2 | 34.46 | 34.24 | 0.64% | :white_check_mark: |
| resnet50v1_fp16 | 1 | 583.97 | 583.34 | 0.11% | :white_check_mark: |
| resnet50v1_int8 | 1 | 585.32 | 581.62 | 0.64% | :white_check_mark: |
| bert_base_cased_fp16 | 64 | 642.51 | 646.23 | -0.58% | :white_check_mark: |
| bert_large_uncased_fp16 | 32 | 197.81 | 199.02 | -0.61% | :white_check_mark: |
| bert_large_fp16 | 1 | 117.42 | 117.58 | -0.14% | :white_check_mark: |
| distilgpt2_fp16 | 16 | 1,204.26 | 1,211.17 | -0.57% | :white_check_mark: |
| yolov5s | 1 | 300.23 | 301.48 | -0.41% | :white_check_mark: |
| tinyllama | 1 | 23.21 | 23.32 | -0.47% | :white_check_mark: |
| vicuna-fastchat | 1 | 133.27 | 133.07 | 0.15% | :white_check_mark: |
| whisper-tiny-encoder | 1 | 243.21 | 244.29 | -0.44% | :white_check_mark: |
| whisper-tiny-decoder | 1 | 255.47 | 256.34 | -0.34% | :white_check_mark: |
This build is not recommended to merge :red_circle:
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output