caffe icon indicating copy to clipboard operation
caffe copied to clipboard

OpenCL counterpart of cuDNN

Open dagamayank opened this issue 8 years ago • 53 comments

I came across your post on the Tensorflow thread that you are developing an OpenCL counterpart for cuDNN. I would like to help/contribute on that project. Let me know where and how can I help. I have extensive OpenCL programming experience and am currently focused on ML activities at AMD.

dagamayank avatar May 25 '16 03:05 dagamayank

@dagamayank Thank you, help is very welcome, especially from AMD :) To start, you can have a look at how the kernels are generated and the public interface of the cuDNN replacement: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at that part of the code and are only interested in providing help on optimizing the kernels for AMD GPUs, which would also be very welcome.

naibaf7 avatar May 25 '16 07:05 naibaf7

@naibaf7 Have you seen last updates on Tensorflow thread?

bhack avatar May 25 '16 07:05 bhack

@bhack Yes, why? :)

naibaf7 avatar May 25 '16 07:05 naibaf7

Cause I think that your work could fit fine in https://docs.google.com/spreadsheets/d/1YbHn7dAFPPG_PgTtgCJlWhMGorUPYsF681TsZ4Y4LP0/edit?usp=sharing

bhack avatar May 25 '16 07:05 bhack

@naibaf7 Kernel strings would be great to have. Also, if you can provide some steps on how to get started that would be great.

On Wed, May 25, 2016 at 2:24 AM, Fabian Tschopp [email protected] wrote:

@dagamayank https://github.com/dagamayank Thank you, help is very welcome, especially from AMD :) To start, you can have a look at how the kernels are generated and the public interface of the cuDNN replacement: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp

https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at that part of the code and are only interested in providing help on optimizing the kernels for AMD GPUs, which would also be very welcome.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-221493984

Mayank Daga "Nothing Succeeds Like Success"

dagamayank avatar May 26 '16 03:05 dagamayank

@dagamayank Ok, the easiest way to get started is to compile Caffe with the USE_LIBDNN turned on in the Makefile.config (https://github.com/naibaf7/caffe/blob/master/Makefile.config.example#L15). Then, if you want to get a kernel string to look for optimization purposes, uncomment this line:

  ss << generate_bw_defs();
  ss << generate_bw_kernels("conv_backward");
  ss << generate_wg_defs();
  ss << generate_wg_kernels("conv_weights");

  // Write complete kernel string
  kernel_ = ss.str();

  // std::cout << kernel_ << std::endl;
}

(it's line https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp#L1588)

This will give you the kernel string in std::cout to examine it for example in AMD's GPU Open CodeXL. Every kernel string will consist of 3 main kernels: conv_forward, conv_backward and conv_weights. For conv_backward and conv_weights, there are 2 different algorithms each that can be selected:

typedef enum {
  // Stack the batch update into one GEMM block
  // (deterministic, 1 kernel call)
  // Serializes the batch and may therefore under use
  // the GPUs compute units.
  LIBDNN_CONVOLUTION_WG_ALGO_DIRECT        = 0,
  // Use multiple GEMM blocks in parallel and update weights atomically
  // (non deterministic, 1 kernel call, not supported on all devices)
  // Parallelizes the batch and has therefore higher GPU usage.
  LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC        = 1,
  // Use multiple GEMM blocks and an intermediate buffer
  // to reduce weight updates
  // (deterministic, >= 2 kernel calls)
  // Parallelizes the batch and has therefore higher GPU usage.
  // NOT IMPLEMENTED YET
  LIBDNN_CONVOLUTION_WG_ALGO_REDUCTION     = 2
} libdnnConvolutionWeightAlgo_t;

typedef enum {
  // Transform data before GEMM (load, im2col, gemm, store)
  // This method is suitable for convolutions with similar
  // spatial input == output sizes, but can become inefficient
  // if input >> output (with large strides and kernels).
  LIBDNN_CONVOLUTION_BW_ALGO_IM2COL        = 0,
  // Transform data after GEMM (load, gemm, col2im, store)
  // Sometimes faster than im2col method, but uses
  // atomic operations and is not deterministic.
  LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC = 1
} libdnnConvolutionBackwardAlgo_t;

which one is being used can be changed here: https://github.com/naibaf7/caffe/blob/master/src/caffe/layers/libdnn_conv_layer.cpp#L63

Finally, you need to run a network in order to instantiate the layers and get some kernel strings. The recommended starting point for that is using the following command:

./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5

Together with the instructions above, you can dump the kernel strings to a text file like that, and look for optimization possibilities that way. Note that every convolution layer gets its own set of kernels, so the above command will give you many different ones.

naibaf7 avatar May 26 '16 22:05 naibaf7

@naibaf7 Thanks a lot for these instructions. I will give them a try and report back.

dagamayank avatar May 27 '16 13:05 dagamayank

I get failure errors on running "make runtest" on the code in master branch of your repo. Is this expected? Two of the errors are from libDNN. My development environment is AMD W9100 and Ubuntu 14.04.

[----------] Global test environment tear-down [==========] 2028 tests from 274 test cases ran. (3614992 ms total) [ PASSED ] 2013 tests. [ FAILED ] 15 tests, listed below: [ FAILED ] NetTest/0.TestSharedWeightsUpdate, where TypeParam = caffe::CPUDevice [ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial11x11x1x2_caffenet_Conv1, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv4, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestGradient_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv3, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5x1x2_caffenet_Conv2, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Convolution_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Gradient_Spatial, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3xPad1, where TypeParam = caffe::GPUDevice [ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5, where TypeParam = caffe::GPUDevice

dagamayank avatar May 31 '16 19:05 dagamayank

@dagamayank TestSharedWeightsUpdate seems to fail by being off by a small margin. This is weird but can be ignored and is not relevant for this implementation.

The _Spatial failures are from Intel's convolution implementation. I think the fix here is to use the latest ViennaCL development branch: https://github.com/viennacl/viennacl-dev instead of what Ubuntu supplies.

As for the libDNN, this test should definitely not fail. Here it would be helpful to get the failure message from the runtest itself (i.e. where the runtest on libdnn aborted. You can test this in detail by using: ./build/test/test_all.testbin --gtest_filter=*LibDNN*Comparative*Backward* 0

naibaf7 avatar May 31 '16 22:05 naibaf7

@naibaf7 Well, I do not clearly understand the output; there are a bunch of lines with values but the last few lines are - Error count: 134841/159600 Difference: 3.17333e+06 (value: 2.30564e+06 vs 2.2954e+06) src/caffe/test/test_libdnn_conv.cpp:1064: Failure Value of: false Expected: failure Which is: true [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double (11638 ms) [----------] 1 test from LibDNNComparativeTest/1 (11638 ms total)

[----------] Global test environment tear-down [==========] 2 tests from 2 test cases ran. (37154 ms total) [ PASSED ] 0 tests. [ FAILED ] 2 tests, listed below: [ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float [ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double

dagamayank avatar Jun 01 '16 02:06 dagamayank

@dagamayank I just verified on my W9100 that the backward pass is fine. What driver are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit). I had problems with the old FirePro driver, so I switched to the Radeon driver.

Do you have any other OpenCL device to check if the backward pass passes the test?

naibaf7 avatar Jun 01 '16 07:06 naibaf7

@naibaf7 Yes, it is probably the old Firepro driver. If it works on your end with the newer driver, I think we can call it a no-issue for now.

I am going through the kernels right now. Can you mention the reason for random values to the #defines? It will take sometime for me to understand what you are doing there.

On Wed, Jun 1, 2016 at 2:36 AM, Fabian Tschopp [email protected] wrote:

@dagamayank https://github.com/dagamayank I just verified on my W9100 that the backward pass is fine. What driver are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit). I had problems with the old FirePro driver, so I switched to the Radeon driver.

Do you have any other OpenCL device to check if the backward pass passes the test?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-222916072, or mute the thread https://github.com/notifications/unsubscribe/AIdLMgIUKLvebfxKvpJv2y3FvITbTpxPks5qHTaWgaJpZM4ImHlA .

Mayank Daga "Nothing Succeeds Like Success"

dagamayank avatar Jun 01 '16 13:06 dagamayank

@dagamayank The defines are defining constants for the kernel, such as padding (v_p), striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each dimension. Other defines are for the GEMM core configuration (such as TSK, TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for better readability of the kernel itself (i.e. easier to see where a constant is used and why). As for documentation, all the values are explained in: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp (look for add_def, which is the C++ method I use for declaring new kernel #defines).

naibaf7 avatar Jun 01 '16 14:06 naibaf7

@naibaf7

Are you using autotuning to generate the values of those constants? In other words, will the constants be same for different kernels and for different networks?

On Wed, Jun 1, 2016 at 9:01 AM, Fabian Tschopp [email protected] wrote:

@dagamayank https://github.com/dagamayank The defines are defining constants for the kernel, such as padding (v_p), striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each dimension. Other defines are for the GEMM core configuration (such as TSK, TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for better readability of the kernel itself (i.e. easier to see where a constant is used and why). As for documentation, all the values are explained in: https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp (look for add_def, which is the C++ method I use for declaring new kernel #defines).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/naibaf7/caffe/issues/34#issuecomment-223001639, or mute the thread https://github.com/notifications/unsubscribe/AIdLMryIrMypGnQtyJCAj583knY-8qvOks5qHZDTgaJpZM4ImHlA .

Mayank Daga "Nothing Succeeds Like Success"

dagamayank avatar Jun 01 '16 17:06 dagamayank

@dagamayank Some of the values can be autotuned (such as WPTM, WPTN), others are defined by the convolution settings (such as v_p, v_s, v_d). However the autotuner can't store the tuning results yet, so that's experimental. That means values such as WPTM, WPTN will be the same for every kernel/network at the moment, while v_p, v_s, v_d depends on what kind of convolution you choose (3x3 unpadded, 11x11 with stride, etc.) the image input/output sizes (v_imsi, v_imso) obviously depend on how big the image/feature maps are in the network.

I hope that helps.

naibaf7 avatar Jun 02 '16 00:06 naibaf7

@dagamayank Have you made any progress on this or is something too complicated?

naibaf7 avatar Jun 03 '16 14:06 naibaf7

@naibaf7 I did not get a chance to work on it yet. Working on some internal fires now but I will soon get to it. Auto-generated kernels are not the most simplest ones to understand :)

dagamayank avatar Jun 03 '16 14:06 dagamayank

@dagamayank I understand. I will work on the project this weekend and hopefully have some improvements until monday. One interesting thing I found is that I'm better off targeting TLP instead of ILP on the AMD W9100, i.e. take care not to use too many VGPRS on the AMD card (to get >= 4 waves in flight). On the nVidia card (GTX 980) it was better to push for high ILP (use more #pragma unroll) and relax on occupancy/TLP. Would be interested what your opinion on this is, and if I am right with these assumptions...

Using vectors of size 4 and 16x16 thread blocks (64x64xTSK shared memory tiling) seems to work best on both cards so far though.

naibaf7 avatar Jun 03 '16 14:06 naibaf7

@naibaf7 In my experience using fewer registers is generally a better choice on AMD GPUs. This allows improved occupancy as well as lets the compiler to generate better code.

One question I had was - do I have to run the entire Alexnet or can I just run the 1st convolution layer using cifar10? What kind of performance are you seeing right now?

dagamayank avatar Jun 03 '16 14:06 dagamayank

@dagamayank You can remove the layers after the 1st convolution in the prototxt file, or start with any other convolution as long as you have the input data defined & connected correctly. However the first convolution is usually not the most interesting as it has only a few input feature maps. Performance wise, on AlexNet forward pass I see these numbers (batch size 64): (These are all untuned in default configuration, so there should be plenty of headroom)

  • GTX 980 cuDNN forward: 34ms
  • GTX 980 libDNN forward (CUDA): 70ms
  • GTX 980 libDNN forward (OpenCL): 90ms
  • W9100 libDNN forward (OpenCL): 100ms (although here you may see 130ms on the code that you have, I improved the memory access pattern since then. I get this performance at 5 waves in flight.).
  • GTX 980 cuBLAS forward: 110ms
  • GTX 980 clBLAS forward: 184ms
  • W9100 clBLAS forward: 275ms

Especially the clBLAS forward performance is extremely detrimental, which was my main motivation to create libDNN. At this stage, libDNN beats cuBLAS-based implementations. The goal is to get within 70-80% of cuDNN.

naibaf7 avatar Jun 03 '16 15:06 naibaf7

@dagamayank LibDNN is now available as a standalone library: https://github.com/naibaf7/libdnn

naibaf7 avatar Jun 25 '16 02:06 naibaf7

@naibaf7 I am very interesting in the LibDNN. It gets a good capability. For I am not familiar with opencl , I just glance over the LibDNN, it seems that it is also using matrix multiplication. If possibly, would your tell me if it is principle same to with cudnn? or so nice as you can provide me the references such as paper or document. Thank you.

zazd avatar Jul 02 '16 03:07 zazd

@zazd Yes it uses a local-memory and register-level GEMM. It is similar to cuDNN, you can read up more here: https://arxiv.org/pdf/1410.0759.pdf

naibaf7 avatar Jul 02 '16 06:07 naibaf7

@bhack @gstoner Good news for the RX 480: Performance issues and thermal envelope crashes have been completely fixed since Linux kernel 4.8 AMDGPU drivers. It is now possible to use the RX 480 for deep learning without limitations on any Linux :)

With LibDNN on both the GTX 1080 and RX 480, the RX 480 performs exactly half as fast as the GTX 1080, just like expected.

naibaf7 avatar Oct 19 '16 12:10 naibaf7

Do you have v2 kernels?

bhack avatar Oct 19 '16 12:10 bhack

@bhack For the external library I did not port them yet... Quite busy with a new project at the moment regarding sparse RNN's. :) Let me know if you need something though. This was just a heads up because the RX 480 did not work well at all for the past 3 months.

naibaf7 avatar Oct 19 '16 13:10 naibaf7

@naibaf7 It is hard to talk about this topic.. We actually are the only one that use libdnn as upstream :wrink:. It could be nice if caffe could use libdnn as upstream naturally instead of having libdnn downstream. /cc @edgarriba

bhack avatar Oct 19 '16 13:10 bhack

@bhack Yeah last week, Codeplay's CEO contacted me regarding some stuff in OpenCL TensorFlow. If he expresses interest as well, I will definitely re-focus more on the libdnn standalone. But I haven't heard back (yet).

naibaf7 avatar Oct 19 '16 13:10 naibaf7

I think also that @hughperkins could be interested to the standalone upstream

bhack avatar Oct 19 '16 13:10 bhack

@naibaf7 do you have Winograd kernels in libDNN?

dagamayank avatar Oct 19 '16 14:10 dagamayank