pytorch
pytorch copied to clipboard
gfx900 on ROCm 3.3 unexpectedly slower than P100 + CUDNN on DAWN benchmark
🐛 Bug
The baseline code in this colab provides a Pytorch example running on a V100 GPU with CUDNN to train a CIFAR10 classifier in 75 seconds.
epoch lr train time train loss train acc test time test loss test acc total time
24 0.0000 2.9813 0.0730 0.9793 0.2113 0.1852 0.9395 74.7050
I have 3 gfx900 GPUs (AMD Radeon VEGA Frontier Edition) and I was able to run the benchmark code provided in the colab on a single GPU. However, the total_time required to train for 24 epochs is at 898seconds, over 10x the seconds required compared to V100.
TORCH_WARN_ONCE(msg);
epoch train time train loss train acc valid time valid loss valid acc total time
1 35.9557 1.6445 0.4101 2.7075 1.3906 0.5057 37.6222
2 36.8833 0.9688 0.6527 2.7220 1.5379 0.4873 74.5055
3 37.1940 0.7708 0.7308 2.7407 1.1057 0.6209 111.6995
4 37.3202 0.6809 0.7644 2.7262 1.0873 0.6543 149.0197
5 37.3730 0.6303 0.7814 2.7267 0.7662 0.7439 186.3927
6 37.4254 0.5707 0.8033 2.7505 0.7760 0.7335 223.8180
7 37.4219 0.5135 0.8257 2.7333 1.0418 0.6868 261.2400
8 37.4235 0.4754 0.8374 2.7201 0.6852 0.7746 298.6635
9 37.4825 0.4468 0.8474 2.7391 0.5049 0.8242 336.1460
10 37.4648 0.4212 0.8541 2.7393 0.5091 0.8254 373.6108
11 37.4690 0.4038 0.8635 2.7577 0.5648 0.8098 411.0799
12 37.4701 0.3878 0.8683 2.7151 0.4781 0.8400 448.5499
13 37.4728 0.3604 0.8780 2.7244 0.4779 0.8344 486.0227
14 37.5206 0.3438 0.8831 2.7219 0.4146 0.8596 523.5433
15 37.4972 0.3303 0.8871 2.7291 0.3838 0.8686 561.0405
16 37.4629 0.3147 0.8932 2.7349 0.4036 0.8626 598.5034
17 37.4848 0.2917 0.9008 2.7495 0.4279 0.8572 635.9882
18 37.4828 0.2748 0.9071 2.7263 0.3441 0.8874 673.4710
19 37.5095 0.2476 0.9156 2.7240 0.3567 0.8813 710.9805
20 37.4723 0.2292 0.9246 2.7507 0.3055 0.8962 748.4528
21 37.4847 0.2036 0.9326 2.7579 0.3267 0.8899 785.9375
22 37.5244 0.1793 0.9410 2.7299 0.2591 0.9140 823.4618
23 37.5084 0.1536 0.9500 2.7428 0.2312 0.9214 860.9702
24 37.4976 0.1309 0.9579 2.7424 0.2077 0.9316 898.4678
The most comparable NVIDIA GPU to the gfx900 would probably be a P100, which should be about 80% the F32LOPs of a V100. So I would expect that training on a gfx900 AMD GPU with ROCm gives about 80% the performance of the numbers reported in that table (74.7 seconds to train for 24 epochs).
However, my numbers are significantly slower than 20%.As you can see in the table above, it's over 10x slower.
To Reproduce
Steps to reproduce the behavior:
- Followed the installation instructions here and spin up Pytorch docker image: https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learning.html
sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm3.3_ubuntu16.04_py3.6_pytorch
git clone -q https://github.com/davidcpage/cifar10-fast.git
cd cifar10-fast && python -m dawn --data_dir=~/data
- The code runs just fine but cudnn_warmup takes a long time and the training speed is slow.
Expected behavior
I expect the model to train in anywhere from 74-100 seconds, not 800+ seconds.
Environment
Attached is the output of running [rocminfo-stdout.txt](https://github.com/ROCmSoftwarePlatform/pytorch/files/4647360/rocminfo-stdout.txt) rocminfo
- PyTorch Version (e.g., 1.0): 1.6.0a0+6d24f8f
- OS (e.g., Linux): Ubuntu Server 16.04
- How you installed PyTorch (
conda
,pip
, source): HIPified Source with ROCm3.3. - Python version: 3.6
- CUDA/cuDNN version: N/A
- GPU models and configuration: See rocminfo-stdout.txt
- Any other relevant information:
Output of rocm-smi:
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 84.0c 160.0W 1348Mhz 945Mhz 40.0% auto 220.0W N/A 100%
1 65.0c 16.0W 1269Mhz 167Mhz 12.94% auto 220.0W N/A 0%
2 80.0c 12.0W 1269Mhz 945Mhz 20.0% auto 220.0W N/A 0%
Additional context
I was able to get access to a P100 machine on GCP, here are the same benchmark numbers for the same code:
epoch train time train loss train acc valid time valid loss valid acc total time
1 9.2695 1.6323 0.4145 0.6062 1.5770 0.4603 12.9949
2 9.1892 0.9406 0.6636 0.6652 1.1614 0.6261 22.1842
3 9.1264 0.7288 0.7444 0.6381 0.7462 0.7452 31.3106
4 9.1475 0.6315 0.7801 0.6135 0.6586 0.7673 40.4581
5 9.1714 0.5591 0.8059 0.6035 0.6903 0.7708 49.6294
6 9.1669 0.4992 0.8275 0.5939 0.6249 0.7886 58.7963
7 9.1465 0.4476 0.8467 0.5916 0.5170 0.8260 67.9428
8 9.2538 0.4135 0.8576 0.5932 0.5194 0.8246 77.1966
9 9.1499 0.3811 0.8706 0.5921 0.4883 0.8364 86.3465
10 9.2431 0.3597 0.8776 0.6017 0.5184 0.8214 95.5896
11 9.2662 0.3474 0.8809 0.6009 0.3747 0.8752 104.8558
12 9.1181 0.3257 0.8891 0.5937 0.4581 0.8426 113.9739
13 9.1877 0.3098 0.8945 0.5955 0.5257 0.8233 123.1616
14 9.1525 0.2915 0.9013 0.5928 0.3652 0.8785 132.3141
15 9.1384 0.2692 0.9085 0.6055 0.4213 0.8616 141.4525
16 9.1531 0.2504 0.9152 0.5913 0.4404 0.8577 150.6056
17 9.1315 0.2302 0.9225 0.6321 0.3360 0.8869 159.7370
18 9.1520 0.2106 0.9294 0.5989 0.3159 0.8903 168.8891
19 9.1911 0.1871 0.9378 0.6079 0.3756 0.8756 178.0802
20 9.1137 0.1657 0.9456 0.6154 0.2744 0.9080 187.1939
21 9.1059 0.1414 0.9543 0.5929 0.2333 0.9224 196.2998
22 9.1593 0.1177 0.9629 0.6155 0.2091 0.9310 205.4591
23 9.1310 0.0950 0.9711 0.6292 0.1926 0.9362 214.5901
24 9.1834 0.0747 0.9789 0.5944 0.1803 0.9404 223.7735
Ran the torch bottleneck profiler and here are the results:
Here are the slowest ops on ROCm:
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
17427493 function calls (17155957 primitive calls) in 1003.873 seconds
Ordered by: internal time
List reduced from 2474 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
5665 895.829 0.158 895.829 0.158 {method 'to' of 'torch._C._TensorBase' objects}
2330 20.051 0.009 20.051 0.009 {method 'run_backward' of 'torch._C._EngineBase' objects}
22480 18.957 0.001 18.957 0.001 {built-in method conv2d}
2819 11.176 0.004 11.176 0.004 {method 'half' of 'torch._C._TensorBase' objects}
96 10.680 0.111 10.680 0.111 {method 'cpu' of 'torch._C._TensorBase' objects}
2808 6.260 0.002 6.260 0.002 {built-in method stack}
1788430 6.177 0.000 6.177 0.000 {method 'copy' of 'numpy.ndarray' objects}
1431936 4.349 0.000 4.349 0.000 {built-in method as_tensor}
Here are the corresponding P100 times:
16803213 function calls (16539458 primitive calls) in 246.693 seconds
Ordered by: internal time
List reduced from 3007 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
5665 159.468 0.028 159.468 0.028 {method 'to' of 'torch._C._TensorBase' objects}
1788351 12.893 0.000 12.893 0.000 {method 'copy' of 'numpy.ndarray' objects}
2808 10.097 0.004 10.097 0.004 {built-in method stack}
2330 7.985 0.003 7.985 0.003 {method 'run_backward' of 'torch._C._EngineBase' objects}
1431936 6.803 0.000 6.803 0.000 {built-in method as_tensor}
1191936 6.430 0.000 26.186 0.000 cifar10-fast/core.py:146(__getitem__)
5616 4.171 0.001 4.171 0.001 {method 'pin_memory' of 'torch._C._TensorBase' objects}
22480 3.103 0.000 3.103 0.000 {built-in method conv2d}
A few things that stand out to me that don't seem attributable to differences in CPU (I forgot to mention that I'm running a AMD Ryzen Threadripper 1900X 8-Core Processor):
- {built-in method conv2d} is much slower
- {method 'to' of 'torch._C._TensorBase' objects} is much slower
Friendly ping, any updates on this? Any information I can provide on my end that would be helpful?
Hi @ericjang , supposedly you are using ROCm3.3, can you provide RPT profiling results on the workload and provide the logs? https://scchan.github.io/hcc/md__home_scchan_code_hcc_doc_markdown_hcc_profile.html
Attached is my prof.out file. It's about 97Mb, which I've uploaded to Google Drive here: https://drive.google.com/file/d/18yP9tBZj4bN1Da5dEsaio1J4Og4O1eGs/view?usp=sharing
Here's the rpt summary output:
/opt/rocm/hcc/bin/rpt ~/cifar10-fast/prof.out
ROI_START: GPU0 0.000000: +0.00 kernel #0.0.1 5: _ZN2at6native6legacy18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implIZZZNS0_21copy_device_to_deviceERNS_14TensorIteratorEbENK3$_0clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS5_RKT_EUliE2_EEviT1_
ROI_STOP : GPU0 484220.262591: +0.00 barrier #0.0.428092 428089: depcnt=0,acq=sys,rel=sys
ROI_TIME= 484.220 secs
Resource=GPU0 Showing 20/116 records 89.84% busy
Total(%) Time(us) Calls Avg(us) Min(us) Max(us) Name
65.81% 318678057.6 17802 17901.3 0.0 57786.7 MIOpenConvUni.kd
9.25% 44798797.4 14 3199914.1 153476.5 17637303.4 gap >=100000us
8.36% 40486481.2 4878 8299.8 0.0 231855.1 MIOpenCvBwdWrW.kd
4.05% 19592963.6 3888 5039.3 0.0 14955.4 _ZN2at6native12_GLOBAL__N_122max_pool_backward_nchwIN3c104HalfEfEEviPKT_PKliiiiiiiiiiiiiiPS5_
2.26% 10964965.3 20 548248.3 25905.3 2850992.1 gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.kd
1.90% 9194758.7 30 306492.0 176083.4 398877.5 _ZN2at6native6legacy18elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_17index_kernel_implINS0_10OpaqueTypeILi2EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESB_EUlPcSC_lE_EEvS8_SB_SB_RKT_EUliE_EEviT1_
1.13% 5475873.9 988 5542.4 0.0 9468.8 Cijk_Alik_Bljk_HBH_MT16x16x24_SE_APM1_AF0EM2_AF1EM1_AMAS3_ASBE01_ASEM2_BL1_DTL0_EPS1_FL0_GRVW2_GSU1_ISA900_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA3_NLCB3_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT2_2_USFGRO1_VAW2_VW2_WG8_8_1_WGM8
1.09% 5281351.7 7776 679.2 0.0 9497.1 MIOpenBatchNormBwdSpatial.kd
0.64% 3081981.8 7776 396.3 0.0 8593.8 MIOpenBatchNormFwdTrainSpatial.kd
0.58% 2806455.3 5488 511.4 0.0 1959.7 _ZN2at6native12_GLOBAL__N_121max_pool_forward_nchwIN3c104HalfES4_EEviPKT_iiiiiiiiiiiiiiPS5_Pl
0.47% 2271842.6 7776 292.2 0.0 8910.5 _ZN2at6native6modern18elementwise_kernelIZZZZNS0_19elu_backward_kernelERNS_14TensorIteratorEN3c106ScalarES6_S6_ENK4$_11clEvENKUlvE1_clEvENKUlvE_clEvEUlNS5_4HalfESA_E_NS_6detail5ArrayIPcLi3EEEEEviT_T0_
0.42% 2030909.2 10976 185.0 0.0 8733.6 _ZN2at6native6modern18elementwise_kernelIZZZZNS0_10elu_kernelERNS_14TensorIteratorEN3c106ScalarES6_S6_ENK4$_10clEvENKUlvE1_clEvENKUlvE_clEvEUlNS5_4HalfEE_NS_6detail5ArrayIPcLi2EEEEEviT_T0_
0.40% 1954777.0 1956 999.4 0.0 1634.8 Cijk_Alik_Bljk_HBH_MT32x32x16_SE_APM1_AF0EM2_AF1EM1_AMAS3_ASBE01_ASEM2_BL1_DTL0_EPS1_FL0_GRVW4_GSU1_ISA900_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT4_4_USFGRO0_VAW2_VW4_WG8_8_1_WGM8
0.40% 1950543.3 2927 666.4 0.0 6033.5 miopenGcnAsmWinogradXformFilter_3_3_4_4.kd
0.40% 1935987.1 52482 36.9 0.0 8328.3 _ZN2at6native6modern18elementwise_kernelIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENK3$_0clEvENKUlvE8_clEvEUlNS5_4HalfES9_E_NS_6detail5ArrayIPcLi3EEEEEviT_T0_
0.38% 1861786.6 210 8865.7 962.1 181621.8 _ZN2at6native28kernel_pointwise_flip_apply2IN3c104HalfElEEvNS_4cuda6detail10TensorInfoIT_T0_EES9_S8_iS8_
0.37% 1770109.8 69 25653.8 10200.0 96879.2 gap 10000us-100000us
0.35% 1692178.8 410453 4.1 0.0 10.0 gap <10us
0.30% 1432763.5 2927 489.5 0.0 9418.5 miopenGcnAsmWinogradXformData_3_3_4_4.kd
0.16% 762718.7 1376 554.3 275.1 1543.1 Cijk_Ailk_Bljk_HBH_MT64x64x4_SE_APM1_AF0EM1_AF1EM1_AMAS2_ASBE01_ASEM1_BL0_DTL0_EPS0_FL0_GRVW2_GSU1_ISA000_IU1_K1_KLS_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR0_PLR0_RK0_SU32_SNLL0_TT4_4_USFGRO0_VAW2_VW2_WG16_16_1_WGM8
Resource=DATA Showing 20/28 records 0.01% busy
Total(%) Time(us) Calls Avg(us) Min(us) Max(us) Name
89.39% 432827972.8 1371 315702.4 127628.8 8065649.0 gap >=100000us
0.34% 1668421.8 33 50558.2 10779.0 74737.4 gap 10000us-100000us
0.01% 58906.2 18 3272.6 1052.0 8928.7 gap 1000us-10000us
0.01% 43082.5 1540 28.0 23.8 49.0 DeviceToDevice_async_fast_3145728_bytes
0.01% 36768.3 208 176.8 104.7 973.3 gap 100us-1000us
0.00% 1865.1 249 7.5 6.7 12.7 HostToDevice_async_fast_24576_bytes
0.00% 1695.4 195 8.7 7.7 9.9 gap <10us
0.00% 576.6 18 32.0 24.0 38.9 gap 20us-50us
0.00% 532.7 30 17.8 15.6 20.7 DeviceToDevice_async_fast_1671168_bytes
0.00% 506.0 6 84.3 52.5 99.8 gap 50us-100us
0.00% 340.0 2 170.0 169.8 170.2 DeviceToDevice_async_fast_4718592_bytes
0.00% 282.4 19 14.9 10.4 18.5 gap 10us-20us
0.00% 86.1 1 86.1 86.1 86.1 DeviceToDevice_async_fast_2359296_bytes
0.00% 67.2 6 11.2 9.5 12.2 DeviceToDevice_async_fast_32768_bytes
0.00% 65.4 8 8.2 7.9 8.6 DeviceToDevice_async_fast_8_bytes
0.00% 56.8 6 9.5 8.2 10.2 DeviceToDevice_async_fast_2048_bytes
0.00% 50.4 6 8.4 8.2 8.6 DeviceToDevice_async_fast_8192_bytes
0.00% 48.8 6 8.1 8.0 8.3 DeviceToDevice_async_fast_512_bytes
0.00% 30.2 2 15.1 15.1 15.1 DeviceToDevice_async_fast_294912_bytes
0.00% 24.9 1 24.9 24.9 24.9 DeviceToDevice_async_fast_589824_bytes
There seems to be a bug with prof.json as Chrome complains about formatting.
SyntaxError: Unexpected token : in JSON at position 119283293
at JSON.parse (<anonymous>)
at new TraceEventImporter (chrome://tracing/tracing.js:6177:19)
at Import.createImporter_ (chrome://tracing/tracing.js:2020:8)
at chrome://tracing/tracing.js:2014:167
at Task.run (chrome://tracing/tracing.js:3247:95)
at runAnother (chrome://tracing/tracing.js:3250:371)
at runTask (chrome://tracing/tracing.js:3015:57)
at processIdleWork (chrome://tracing/tracing.js:3020:116)
at window.requestIdleCallback.timeout (chrome://tracing/tracing.js:3013:81)
Thanks @ericjang , from the RPT profiling results it seems the GPU utilization rate is reasonable ~90% over the training sessions. Most of the time were spent on MIOpenConvUni.kd. @daniellowell can you help comment what’s that kernel represent for in ROCm3.3?
Any tips on what I should do to speed things up? I'm training a fairly standard convnet setup, so I expect this will be a significant issue for PyTorch users when comparing AMD and NVIDIA hardware.
Friendly ping - any updates? I use AMD gpus for hobby deep learning projects at home and it's demoralizing that I can't train my NNs very fast (not to mention multi-gpu issues, but I wanted to bring up one issue at a time).
Hi @ericjang , l’m able to reproduce your observations locally and looking into the problem, will keep you posted soon.
The workload in this issue uses FP16 precision, for which MIOpen doesn't have optimized convolution kernels for GFX900 GPUs. He're the list of convolution configs been involved:
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
I can try to tune the configs and see if there're any improvements. However, the recommendation is to use FP32 workloads on GFX900 GPUs.
I’ve tried to tune the involved FP16 configs with MIOpen. In ROCm3.5 docker container, before tuning: epoch train time train loss train acc valid time valid loss valid acc total time 1 37.1966 1.6711 0.3951 20.2976 1.6296 0.4529 38.8472 2 37.3670 0.9893 0.6476 2.8158 2.1502 0.4324 76.2142
After tuning:
epoch train time train loss train acc valid time valid loss valid acc total time
1 29.4429 1.6631 0.3956 1.5667 1.2477 0.5471 31.0917
2 29.5801 0.9720 0.6527 1.5812 1.9774 0.4506 60.6717
There’re some improvement but not that huge. If you’d like to try so, you can set the following environment variable and execute the MIOpenDriver commands I posted in the last comment, under /opt/rocm/miopen folder: export MIOPEN_FIND_ENFORCE=4 The retuned perfDB will be saved under ~/.config/miopen folder after re-tuning.
Hi, thanks all for the recommendations. switching to fp32 seems to indeed speed things up: Dawn.py (21 sec / step):
epoch train time train loss train acc valid time valid loss valid acc total time
1 20.5358 1.6424 0.4103 11.0456 1.2769 0.5449 21.9777
2 20.6112 0.9497 0.6602 0.8756 1.0673 0.6319 42.5889
3 21.3357 0.7417 0.7379 0.9663 0.8248 0.7225 63.9246
4 21.9944 0.6333 0.7788 1.1072 0.7426 0.7515 85.9190
5 22.0416 0.5572 0.8077 1.1083 0.6171 0.7865 107.9606
6 22.1669 0.5049 0.8271 1.1044 0.5599 0.8052 130.1274
7 22.0657 0.4467 0.8460 1.1175 0.5430 0.8118 152.1931
8 22.0669 0.4083 0.8599 1.1080 0.4409 0.8470 174.2600
9 22.1657 0.3880 0.8681 1.1199 0.5545 0.8077 196.4257
10 22.1372 0.3619 0.8751 1.1281 0.4469 0.8509 218.5629
11 22.0678 0.3465 0.8827 1.1284 0.5636 0.8094 240.6307
12 22.1248 0.3281 0.8883 1.1155 0.3705 0.8764 262.7555
13 22.1635 0.3088 0.8947 1.1380 0.3938 0.8673 284.9190
14 22.1789 0.2901 0.9017 1.1258 0.3813 0.8675 307.0979
15 22.1483 0.2778 0.9060 1.1149 0.4508 0.8555 329.2462
16 22.0802 0.2504 0.9160 1.1180 0.3513 0.8813 351.3264
17 22.1499 0.2349 0.9208 1.1190 0.3639 0.8768 373.4763
18 22.1257 0.2092 0.9307 1.1353 0.3258 0.8863 395.6020
19 22.1024 0.1891 0.9364 1.1229 0.3602 0.8813 417.7044
20 22.1497 0.1647 0.9457 1.1333 0.2721 0.9102 439.8542
21 22.1744 0.1432 0.9529 1.1308 0.2555 0.9187 462.0285
22 22.1492 0.1162 0.9632 1.1272 0.2222 0.9261 484.1777
23 22.2179 0.0963 0.9715 1.1313 0.1984 0.9350 506.3957
24 22.1468 0.0774 0.9783 1.1300 0.1819 0.9385 528.5425
New_dawn.py (15 sec / step)
epoch train time train loss train acc valid time valid loss valid acc total time
1 15.5448 1.7815 0.4852 1.4130 2.2316 0.2021 15.9537
2 14.7322 1.3284 0.7671 1.4118 1.5720 0.7206 30.6859
3 14.8322 1.2203 0.8266 1.4755 1.1691 0.8712 45.5182
4 15.7673 1.1612 0.8572 1.4851 1.0926 0.8959 61.2855
5 15.9183 1.1168 0.8814 1.7515 1.0578 0.9112 77.2038
6 16.5861 1.0786 0.9017 1.8752 1.0345 0.9212 93.7899
7 16.6969 1.0445 0.9203 1.8199 1.0193 0.9262 110.4869
8 16.7539 1.0023 0.9431 1.8484 1.0055 0.9315 127.2407
9 16.7829 0.9681 0.9612 1.8855 0.9964 0.9368 144.0236
10 16.8777 0.9547 0.9683 1.8347 0.9902 0.9400 160.9012
A resulting pleasant surprise was that fp32 seems to converge faster than fp16 on reaching 94% top-1 accuracy.
Will there eventually be support for faster fp16 miopen kernels on gfx900?
Is there any progress?