pytorch gfx900 on ROCm 3.3 unexpectedly slower than P100 + CUDNN on DAWN benchmark

🐛 Bug

The baseline code in this colab provides a Pytorch example running on a V100 GPU with CUDNN to train a CIFAR10 classifier in 75 seconds.

       epoch           lr   train time   train loss    train acc    test time    test loss     test acc   total time
          24       0.0000       2.9813       0.0730       0.9793       0.2113       0.1852       0.9395      74.7050

I have 3 gfx900 GPUs (AMD Radeon VEGA Frontier Edition) and I was able to run the benchmark code provided in the colab on a single GPU. However, the total_time required to train for 24 epochs is at 898seconds, over 10x the seconds required compared to V100.

TORCH_WARN_ONCE(msg);
       epoch   train time   train loss    train acc   valid time   valid loss    valid acc   total time
           1      35.9557       1.6445       0.4101       2.7075       1.3906       0.5057      37.6222
           2      36.8833       0.9688       0.6527       2.7220       1.5379       0.4873      74.5055
           3      37.1940       0.7708       0.7308       2.7407       1.1057       0.6209     111.6995
           4      37.3202       0.6809       0.7644       2.7262       1.0873       0.6543     149.0197
           5      37.3730       0.6303       0.7814       2.7267       0.7662       0.7439     186.3927
           6      37.4254       0.5707       0.8033       2.7505       0.7760       0.7335     223.8180
           7      37.4219       0.5135       0.8257       2.7333       1.0418       0.6868     261.2400
           8      37.4235       0.4754       0.8374       2.7201       0.6852       0.7746     298.6635
           9      37.4825       0.4468       0.8474       2.7391       0.5049       0.8242     336.1460
          10      37.4648       0.4212       0.8541       2.7393       0.5091       0.8254     373.6108
          11      37.4690       0.4038       0.8635       2.7577       0.5648       0.8098     411.0799
          12      37.4701       0.3878       0.8683       2.7151       0.4781       0.8400     448.5499
          13      37.4728       0.3604       0.8780       2.7244       0.4779       0.8344     486.0227
          14      37.5206       0.3438       0.8831       2.7219       0.4146       0.8596     523.5433
          15      37.4972       0.3303       0.8871       2.7291       0.3838       0.8686     561.0405
          16      37.4629       0.3147       0.8932       2.7349       0.4036       0.8626     598.5034
          17      37.4848       0.2917       0.9008       2.7495       0.4279       0.8572     635.9882
          18      37.4828       0.2748       0.9071       2.7263       0.3441       0.8874     673.4710
          19      37.5095       0.2476       0.9156       2.7240       0.3567       0.8813     710.9805
          20      37.4723       0.2292       0.9246       2.7507       0.3055       0.8962     748.4528
          21      37.4847       0.2036       0.9326       2.7579       0.3267       0.8899     785.9375
          22      37.5244       0.1793       0.9410       2.7299       0.2591       0.9140     823.4618
          23      37.5084       0.1536       0.9500       2.7428       0.2312       0.9214     860.9702
          24      37.4976       0.1309       0.9579       2.7424       0.2077       0.9316     898.4678

The most comparable NVIDIA GPU to the gfx900 would probably be a P100, which should be about 80% the F32LOPs of a V100. So I would expect that training on a gfx900 AMD GPU with ROCm gives about 80% the performance of the numbers reported in that table (74.7 seconds to train for 24 epochs).

However, my numbers are significantly slower than 20%.As you can see in the table above, it's over 10x slower.

To Reproduce

Steps to reproduce the behavior:

Followed the installation instructions here and spin up Pytorch docker image: https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learning.html sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/pytorch:rocm3.3_ubuntu16.04_py3.6_pytorch

git clone -q https://github.com/davidcpage/cifar10-fast.git
cd cifar10-fast && python -m dawn --data_dir=~/data

The code runs just fine but cudnn_warmup takes a long time and the training speed is slow.

Expected behavior

I expect the model to train in anywhere from 74-100 seconds, not 800+ seconds.

Environment

Attached is the output of running [rocminfo-stdout.txt](https://github.com/ROCmSoftwarePlatform/pytorch/files/4647360/rocminfo-stdout.txt) rocminfo

PyTorch Version (e.g., 1.0): 1.6.0a0+6d24f8f
OS (e.g., Linux): Ubuntu Server 16.04
How you installed PyTorch (conda, pip, source): HIPified Source with ROCm3.3.
Python version: 3.6
CUDA/cuDNN version: N/A
GPU models and configuration: See rocminfo-stdout.txt
Any other relevant information:

Output of rocm-smi:

GPU  Temp   AvgPwr  SCLK     MCLK    Fan     Perf  PwrCap  VRAM%  GPU%
0    84.0c  160.0W  1348Mhz  945Mhz  40.0%   auto  220.0W  N/A    100%
1    65.0c  16.0W   1269Mhz  167Mhz  12.94%  auto  220.0W  N/A    0%
2    80.0c  12.0W   1269Mhz  945Mhz  20.0%   auto  220.0W  N/A    0%

Additional context

May 18 '20 23:05 ericjang

I was able to get access to a P100 machine on GCP, here are the same benchmark numbers for the same code:

       epoch   train time   train loss    train acc   valid time   valid loss    valid acc   total time
           1       9.2695       1.6323       0.4145       0.6062       1.5770       0.4603      12.9949
           2       9.1892       0.9406       0.6636       0.6652       1.1614       0.6261      22.1842
           3       9.1264       0.7288       0.7444       0.6381       0.7462       0.7452      31.3106
           4       9.1475       0.6315       0.7801       0.6135       0.6586       0.7673      40.4581
           5       9.1714       0.5591       0.8059       0.6035       0.6903       0.7708      49.6294
           6       9.1669       0.4992       0.8275       0.5939       0.6249       0.7886      58.7963
           7       9.1465       0.4476       0.8467       0.5916       0.5170       0.8260      67.9428
           8       9.2538       0.4135       0.8576       0.5932       0.5194       0.8246      77.1966
           9       9.1499       0.3811       0.8706       0.5921       0.4883       0.8364      86.3465
          10       9.2431       0.3597       0.8776       0.6017       0.5184       0.8214      95.5896
          11       9.2662       0.3474       0.8809       0.6009       0.3747       0.8752     104.8558
          12       9.1181       0.3257       0.8891       0.5937       0.4581       0.8426     113.9739
          13       9.1877       0.3098       0.8945       0.5955       0.5257       0.8233     123.1616
          14       9.1525       0.2915       0.9013       0.5928       0.3652       0.8785     132.3141
          15       9.1384       0.2692       0.9085       0.6055       0.4213       0.8616     141.4525
          16       9.1531       0.2504       0.9152       0.5913       0.4404       0.8577     150.6056
          17       9.1315       0.2302       0.9225       0.6321       0.3360       0.8869     159.7370
          18       9.1520       0.2106       0.9294       0.5989       0.3159       0.8903     168.8891
          19       9.1911       0.1871       0.9378       0.6079       0.3756       0.8756     178.0802
          20       9.1137       0.1657       0.9456       0.6154       0.2744       0.9080     187.1939
          21       9.1059       0.1414       0.9543       0.5929       0.2333       0.9224     196.2998
          22       9.1593       0.1177       0.9629       0.6155       0.2091       0.9310     205.4591
          23       9.1310       0.0950       0.9711       0.6292       0.1926       0.9362     214.5901
          24       9.1834       0.0747       0.9789       0.5944       0.1803       0.9404     223.7735

May 18 '20 23:05 ericjang

Ran the torch bottleneck profiler and here are the results:

Here are the slowest ops on ROCm:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         17427493 function calls (17155957 primitive calls) in 1003.873 seconds

   Ordered by: internal time
   List reduced from 2474 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5665  895.829    0.158  895.829    0.158 {method 'to' of 'torch._C._TensorBase' objects}
     2330   20.051    0.009   20.051    0.009 {method 'run_backward' of 'torch._C._EngineBase' objects}
    22480   18.957    0.001   18.957    0.001 {built-in method conv2d}
     2819   11.176    0.004   11.176    0.004 {method 'half' of 'torch._C._TensorBase' objects}
       96   10.680    0.111   10.680    0.111 {method 'cpu' of 'torch._C._TensorBase' objects}
     2808    6.260    0.002    6.260    0.002 {built-in method stack}
  1788430    6.177    0.000    6.177    0.000 {method 'copy' of 'numpy.ndarray' objects}
  1431936    4.349    0.000    4.349    0.000 {built-in method as_tensor}

Here are the corresponding P100 times:

        16803213 function calls (16539458 primitive calls) in 246.693 seconds
   Ordered by: internal time
   List reduced from 3007 to 15 due to restriction <15>
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5665  159.468    0.028  159.468    0.028 {method 'to' of 'torch._C._TensorBase' objects}
  1788351   12.893    0.000   12.893    0.000 {method 'copy' of 'numpy.ndarray' objects}
     2808   10.097    0.004   10.097    0.004 {built-in method stack}
     2330    7.985    0.003    7.985    0.003 {method 'run_backward' of 'torch._C._EngineBase' objects}
  1431936    6.803    0.000    6.803    0.000 {built-in method as_tensor}
  1191936    6.430    0.000   26.186    0.000 cifar10-fast/core.py:146(__getitem__)
     5616    4.171    0.001    4.171    0.001 {method 'pin_memory' of 'torch._C._TensorBase' objects}
    22480    3.103    0.000    3.103    0.000 {built-in method conv2d}

A few things that stand out to me that don't seem attributable to differences in CPU (I forgot to mention that I'm running a AMD Ryzen Threadripper 1900X 8-Core Processor):

{built-in method conv2d} is much slower
{method 'to' of 'torch._C._TensorBase' objects} is much slower

May 19 '20 01:05 ericjang

Friendly ping, any updates on this? Any information I can provide on my end that would be helpful?

May 29 '20 17:05 ericjang

Hi @ericjang , supposedly you are using ROCm3.3, can you provide RPT profiling results on the workload and provide the logs? https://scchan.github.io/hcc/md__home_scchan_code_hcc_doc_markdown_hcc_profile.html

Jun 08 '20 14:06 sunway513

Attached is my prof.out file. It's about 97Mb, which I've uploaded to Google Drive here: https://drive.google.com/file/d/18yP9tBZj4bN1Da5dEsaio1J4Og4O1eGs/view?usp=sharing

Here's the rpt summary output:

/opt/rocm/hcc/bin/rpt ~/cifar10-fast/prof.out
ROI_START: GPU0         0.000000:      +0.00 kernel  #0.0.1        5: _ZN2at6native6legacy18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implIZZZNS0_21copy_device_to_deviceERNS_14TensorIteratorEbENK3$_0clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS5_RKT_EUliE2_EEviT1_
ROI_STOP : GPU0    484220.262591:      +0.00 barrier #0.0.428092   428089: depcnt=0,acq=sys,rel=sys
ROI_TIME= 484.220 secs

Resource=GPU0 Showing 20/116 records   89.84% busy
      Total(%)    Time(us)    Calls  Avg(us)  Min(us)  Max(us)  Name
        65.81% 318678057.6    17802  17901.3      0.0  57786.7  MIOpenConvUni.kd
         9.25% 44798797.4       14 3199914.1 153476.5 17637303.4  gap >=100000us
         8.36% 40486481.2     4878   8299.8      0.0 231855.1  MIOpenCvBwdWrW.kd
         4.05% 19592963.6     3888   5039.3      0.0  14955.4  _ZN2at6native12_GLOBAL__N_122max_pool_backward_nchwIN3c104HalfEfEEviPKT_PKliiiiiiiiiiiiiiPS5_
         2.26% 10964965.3       20 548248.3  25905.3 2850992.1  gridwise_convolution_implicit_gemm_v4r1_nchw_kcyx_nkhw_lds_double_buffer.kd
         1.90%  9194758.7       30 306492.0 176083.4 398877.5  _ZN2at6native6legacy18elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_17index_kernel_implINS0_10OpaqueTypeILi2EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESB_EUlPcSC_lE_EEvS8_SB_SB_RKT_EUliE_EEviT1_
         1.13%  5475873.9      988   5542.4      0.0   9468.8  Cijk_Alik_Bljk_HBH_MT16x16x24_SE_APM1_AF0EM2_AF1EM1_AMAS3_ASBE01_ASEM2_BL1_DTL0_EPS1_FL0_GRVW2_GSU1_ISA900_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA3_NLCB3_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT2_2_USFGRO1_VAW2_VW2_WG8_8_1_WGM8
         1.09%  5281351.7     7776    679.2      0.0   9497.1  MIOpenBatchNormBwdSpatial.kd
         0.64%  3081981.8     7776    396.3      0.0   8593.8  MIOpenBatchNormFwdTrainSpatial.kd
         0.58%  2806455.3     5488    511.4      0.0   1959.7  _ZN2at6native12_GLOBAL__N_121max_pool_forward_nchwIN3c104HalfES4_EEviPKT_iiiiiiiiiiiiiiPS5_Pl
         0.47%  2271842.6     7776    292.2      0.0   8910.5  _ZN2at6native6modern18elementwise_kernelIZZZZNS0_19elu_backward_kernelERNS_14TensorIteratorEN3c106ScalarES6_S6_ENK4$_11clEvENKUlvE1_clEvENKUlvE_clEvEUlNS5_4HalfESA_E_NS_6detail5ArrayIPcLi3EEEEEviT_T0_
         0.42%  2030909.2    10976    185.0      0.0   8733.6  _ZN2at6native6modern18elementwise_kernelIZZZZNS0_10elu_kernelERNS_14TensorIteratorEN3c106ScalarES6_S6_ENK4$_10clEvENKUlvE1_clEvENKUlvE_clEvEUlNS5_4HalfEE_NS_6detail5ArrayIPcLi2EEEEEviT_T0_
         0.40%  1954777.0     1956    999.4      0.0   1634.8  Cijk_Alik_Bljk_HBH_MT32x32x16_SE_APM1_AF0EM2_AF1EM1_AMAS3_ASBE01_ASEM2_BL1_DTL0_EPS1_FL0_GRVW4_GSU1_ISA900_IU1_K1_KLA_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT4_4_USFGRO0_VAW2_VW4_WG8_8_1_WGM8
         0.40%  1950543.3     2927    666.4      0.0   6033.5  miopenGcnAsmWinogradXformFilter_3_3_4_4.kd
         0.40%  1935987.1    52482     36.9      0.0   8328.3  _ZN2at6native6modern18elementwise_kernelIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENK3$_0clEvENKUlvE8_clEvEUlNS5_4HalfES9_E_NS_6detail5ArrayIPcLi3EEEEEviT_T0_
         0.38%  1861786.6      210   8865.7    962.1 181621.8  _ZN2at6native28kernel_pointwise_flip_apply2IN3c104HalfElEEvNS_4cuda6detail10TensorInfoIT_T0_EES9_S8_iS8_
         0.37%  1770109.8       69  25653.8  10200.0  96879.2  gap 10000us-100000us
         0.35%  1692178.8   410453      4.1      0.0     10.0  gap <10us
         0.30%  1432763.5     2927    489.5      0.0   9418.5  miopenGcnAsmWinogradXformData_3_3_4_4.kd
         0.16%   762718.7     1376    554.3    275.1   1543.1  Cijk_Ailk_Bljk_HBH_MT64x64x4_SE_APM1_AF0EM1_AF1EM1_AMAS2_ASBE01_ASEM1_BL0_DTL0_EPS0_FL0_GRVW2_GSU1_ISA000_IU1_K1_KLS_LPA0_LPB0_LDL1_NLCA1_NLCB1_ONLL1_PBD0_PK0_PGR0_PLR0_RK0_SU32_SNLL0_TT4_4_USFGRO0_VAW2_VW2_WG16_16_1_WGM8

Resource=DATA Showing 20/28 records    0.01% busy
      Total(%)    Time(us)    Calls  Avg(us)  Min(us)  Max(us)  Name
        89.39% 432827972.8     1371 315702.4 127628.8 8065649.0  gap >=100000us
         0.34%  1668421.8       33  50558.2  10779.0  74737.4  gap 10000us-100000us
         0.01%    58906.2       18   3272.6   1052.0   8928.7  gap 1000us-10000us
         0.01%    43082.5     1540     28.0     23.8     49.0  DeviceToDevice_async_fast_3145728_bytes
         0.01%    36768.3      208    176.8    104.7    973.3  gap 100us-1000us
         0.00%     1865.1      249      7.5      6.7     12.7  HostToDevice_async_fast_24576_bytes
         0.00%     1695.4      195      8.7      7.7      9.9  gap <10us
         0.00%      576.6       18     32.0     24.0     38.9  gap 20us-50us
         0.00%      532.7       30     17.8     15.6     20.7  DeviceToDevice_async_fast_1671168_bytes
         0.00%      506.0        6     84.3     52.5     99.8  gap 50us-100us
         0.00%      340.0        2    170.0    169.8    170.2  DeviceToDevice_async_fast_4718592_bytes
         0.00%      282.4       19     14.9     10.4     18.5  gap 10us-20us
         0.00%       86.1        1     86.1     86.1     86.1  DeviceToDevice_async_fast_2359296_bytes
         0.00%       67.2        6     11.2      9.5     12.2  DeviceToDevice_async_fast_32768_bytes
         0.00%       65.4        8      8.2      7.9      8.6  DeviceToDevice_async_fast_8_bytes
         0.00%       56.8        6      9.5      8.2     10.2  DeviceToDevice_async_fast_2048_bytes
         0.00%       50.4        6      8.4      8.2      8.6  DeviceToDevice_async_fast_8192_bytes
         0.00%       48.8        6      8.1      8.0      8.3  DeviceToDevice_async_fast_512_bytes
         0.00%       30.2        2     15.1     15.1     15.1  DeviceToDevice_async_fast_294912_bytes
         0.00%       24.9        1     24.9     24.9     24.9  DeviceToDevice_async_fast_589824_bytes

There seems to be a bug with prof.json as Chrome complains about formatting.

SyntaxError: Unexpected token : in JSON at position 119283293
    at JSON.parse (<anonymous>)
    at new TraceEventImporter (chrome://tracing/tracing.js:6177:19)
    at Import.createImporter_ (chrome://tracing/tracing.js:2020:8)
    at chrome://tracing/tracing.js:2014:167
    at Task.run (chrome://tracing/tracing.js:3247:95)
    at runAnother (chrome://tracing/tracing.js:3250:371)
    at runTask (chrome://tracing/tracing.js:3015:57)
    at processIdleWork (chrome://tracing/tracing.js:3020:116)
    at window.requestIdleCallback.timeout (chrome://tracing/tracing.js:3013:81)

Jun 09 '20 03:06 ericjang

Thanks @ericjang , from the RPT profiling results it seems the GPU utilization rate is reasonable ~90% over the training sessions. Most of the time were spent on MIOpenConvUni.kd. @daniellowell can you help comment what’s that kernel represent for in ROCm3.3?

Jun 09 '20 03:06 sunway513

Any tips on what I should do to speed things up? I'm training a fairly standard convnet setup, so I expect this will be a significant issue for PyTorch users when comparing AMD and NVIDIA hardware.

Jun 15 '20 19:06 ericjang

Friendly ping - any updates? I use AMD gpus for hobby deep learning projects at home and it's demoralizing that I can't train my NNs very fast (not to mention multi-gpu issues, but I wanted to bring up one issue at a time).

Jun 24 '20 05:06 ericjang

Hi @ericjang , l’m able to reproduce your observations locally and looking into the problem, will keep you posted soon.

Jun 24 '20 06:06 sunway513

The workload in this issue uses FP16 precision, for which MIOpen doesn't have optimized convolution kernels for GFX900 GPUs. He're the list of convolution configs been involved:

 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 272 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 128 -H 16 -W 16 -k 256 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 256 -H 8 -W 8 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 3 -H 32 -W 32 -k 64 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 512 -H 4 -W 4 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
 ./bin/MIOpenDriver convfp16 -n 512 -c 64 -H 32 -W 32 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1

I can try to tune the configs and see if there're any improvements. However, the recommendation is to use FP32 workloads on GFX900 GPUs.

Jun 24 '20 06:06 sunway513

I’ve tried to tune the involved FP16 configs with MIOpen. In ROCm3.5 docker container, before tuning: epoch train time train loss train acc valid time valid loss valid acc total time 1 37.1966 1.6711 0.3951 20.2976 1.6296 0.4529 38.8472 2 37.3670 0.9893 0.6476 2.8158 2.1502 0.4324 76.2142

After tuning: epoch train time train loss train acc valid time valid loss valid acc total time
1 29.4429 1.6631 0.3956 1.5667 1.2477 0.5471 31.0917
2 29.5801 0.9720 0.6527 1.5812 1.9774 0.4506 60.6717

There’re some improvement but not that huge. If you’d like to try so, you can set the following environment variable and execute the MIOpenDriver commands I posted in the last comment, under /opt/rocm/miopen folder: export MIOPEN_FIND_ENFORCE=4 The retuned perfDB will be saved under ~/.config/miopen folder after re-tuning.

Jun 25 '20 03:06 sunway513

Hi, thanks all for the recommendations. switching to fp32 seems to indeed speed things up: Dawn.py (21 sec / step):

 epoch   train time   train loss    train acc   valid time   valid loss    valid acc   total time
           1      20.5358       1.6424       0.4103      11.0456       1.2769       0.5449      21.9777
           2      20.6112       0.9497       0.6602       0.8756       1.0673       0.6319      42.5889
           3      21.3357       0.7417       0.7379       0.9663       0.8248       0.7225      63.9246
           4      21.9944       0.6333       0.7788       1.1072       0.7426       0.7515      85.9190
           5      22.0416       0.5572       0.8077       1.1083       0.6171       0.7865     107.9606
           6      22.1669       0.5049       0.8271       1.1044       0.5599       0.8052     130.1274
           7      22.0657       0.4467       0.8460       1.1175       0.5430       0.8118     152.1931
           8      22.0669       0.4083       0.8599       1.1080       0.4409       0.8470     174.2600
           9      22.1657       0.3880       0.8681       1.1199       0.5545       0.8077     196.4257
          10      22.1372       0.3619       0.8751       1.1281       0.4469       0.8509     218.5629
          11      22.0678       0.3465       0.8827       1.1284       0.5636       0.8094     240.6307
          12      22.1248       0.3281       0.8883       1.1155       0.3705       0.8764     262.7555
          13      22.1635       0.3088       0.8947       1.1380       0.3938       0.8673     284.9190
          14      22.1789       0.2901       0.9017       1.1258       0.3813       0.8675     307.0979
          15      22.1483       0.2778       0.9060       1.1149       0.4508       0.8555     329.2462
          16      22.0802       0.2504       0.9160       1.1180       0.3513       0.8813     351.3264
          17      22.1499       0.2349       0.9208       1.1190       0.3639       0.8768     373.4763
          18      22.1257       0.2092       0.9307       1.1353       0.3258       0.8863     395.6020
          19      22.1024       0.1891       0.9364       1.1229       0.3602       0.8813     417.7044
          20      22.1497       0.1647       0.9457       1.1333       0.2721       0.9102     439.8542
          21      22.1744       0.1432       0.9529       1.1308       0.2555       0.9187     462.0285
          22      22.1492       0.1162       0.9632       1.1272       0.2222       0.9261     484.1777
          23      22.2179       0.0963       0.9715       1.1313       0.1984       0.9350     506.3957
          24      22.1468       0.0774       0.9783       1.1300       0.1819       0.9385     528.5425

New_dawn.py (15 sec / step)

 epoch   train time   train loss    train acc   valid time   valid loss    valid acc   total time
           1      15.5448       1.7815       0.4852       1.4130       2.2316       0.2021      15.9537
           2      14.7322       1.3284       0.7671       1.4118       1.5720       0.7206      30.6859
           3      14.8322       1.2203       0.8266       1.4755       1.1691       0.8712      45.5182
           4      15.7673       1.1612       0.8572       1.4851       1.0926       0.8959      61.2855
           5      15.9183       1.1168       0.8814       1.7515       1.0578       0.9112      77.2038
           6      16.5861       1.0786       0.9017       1.8752       1.0345       0.9212      93.7899
           7      16.6969       1.0445       0.9203       1.8199       1.0193       0.9262     110.4869
           8      16.7539       1.0023       0.9431       1.8484       1.0055       0.9315     127.2407
           9      16.7829       0.9681       0.9612       1.8855       0.9964       0.9368     144.0236
          10      16.8777       0.9547       0.9683       1.8347       0.9902       0.9400     160.9012

A resulting pleasant surprise was that fp32 seems to converge faster than fp16 on reaching 94% top-1 accuracy.

Will there eventually be support for faster fp16 miopen kernels on gfx900?

Jul 26 '20 05:07 ericjang

Is there any progress?

Jul 06 '23 02:07 daiaji

pytorch pytorch copied to clipboard

gfx900 on ROCm 3.3 unexpectedly slower than P100 + CUDNN on DAWN benchmark

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

pytorch
pytorch copied to clipboard