libdnn
libdnn copied to clipboard
integration to tinny-cnn
@naibaf7 @bhack
I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.
Currently, I implemented a small interface to get native OpenCL context from tiny-cnn: https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h
Things I think that are needed:
- Implement a module for data transfer between devices
- Discuss the shape of libdnn simplified interface if is needed.
BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. https://github.com/nyanp/tiny-cnn/issues/235
@edgarriba Ok, I'll look into it. What is your idea about the data transfer and simplified interface?
@naibaf7 for memory syncing I was thinking in an approach similar to TF. What do you think it's the best to start? https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L46
Regarding simplified interface, just to make clear what type of data will need libdnn interfaces.
@naibaf7 What do you think of this design?
@bhack Not bad, but essentially no different from functions already existing in either ViennaCL or Caffe code. It's really unfortunate that tiny-cnn can't handle any GPU related management itself at the moment, as both cuDNN and libDNN expect a proper GPU & memory initialization beforehand (and I oriented the libDNN interface closely to the cuDNN one).
What do you want to be our responsabilities? Device, memory and context? Kernel launch?
@bhack If you want compatibility with typical GPU libraries, then optimally device, memory and context. Kernel launch is handled enclosed in both cuDNN and libDNN, because this needs tricky parameter selection.
But if you can wait 2 days, I will update libDNN with simplifications for the device, memory and context part (basically copied over from Caffe). Basically just wrappers for memory allocation, device listing & initialization & memory transfer around CUDA and ViennaCL.
We have no problem to handle device, memory and context. If you think that would be useful to have this features here ok. If not we will implement this in tiny. We need to think at the largest audience for this library. So if you think that generally callers benefit for handling context, device and memories itself for other kind of operations it is ok for us to implement in tiny and go ahead with other kernel coverage porting here.
@naibaf7 I leave here the error report I get: http://pastebin.com/yv6rmu21
@edgarriba What steps do you use to test the integration (just so I can replicate)...? The error looks familiar from early OpenCL Caffe tests, so it should be something I can fix.
with this branch: https://github.com/edgarriba/tiny-cnn/tree/libdnn
cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON -DUSE_TBB=ON ..
make && ./test/tiny_cnn_test
important routines: https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/backend_dnn.h#L76 https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/kernels/libdnn_conv2d_kernel.h#L54
@edgarriba Ok I gained a good understanding of what you are trying to do and what issues the code has. I filed a pull request on your tinycnn/libdnn branch which fixes some of the issues and explains some details of the problems that still exist.
If you need more assistance in fixing these, or want to discuss it in more details, we can schedule a Skype conversation with screen sharing and revise the code that way.
By the way, random observation, not sure if this is useful or not, but if you use the clew abstraction layer, then you can build for opencl, without libOpenCL.so being present, and you can even run your program, without libOpenCL.so being present. Your program can make a runtime-decision about whether to try binding with opencl or not. Actually, I think it can even attempt to call clewInit()
, and thus detect if libOpenCL.so is present, and so on, and deal accordingly. (not sure about this last sentence, would be easy to fix this if it's not actually the case)
@hughperkins Intriguing! Does that also work with CUDA? Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :) What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?
Does that also work with CUDA?
No
Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)
:-) Well, I geuss someone could create something analagous for cuda. Maybe... you? :-)
What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?
For libOpenCL.so, it searches in a couple of hard-coded locations:
https://github.com/hughperkins/clew/blob/master/src/clew.c#L180-L184
module = CLEW_DYNLIB_OPEN("OpenCL.dll");
if(module == 0) module = CLEW_DYNLIB_OPEN("/Library/Frameworks/OpenCL.framework/OpenCL");
if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so");
if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so.1");
if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so");
if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so.1");
You can see that there is no reason why this couldnt be extended arbitrarily and/or read from some config file. But this covers almost all, or perhaps all?, cases I've seen, I think?
@hughperkins Yeah, I see... yes maybe you could add possible locations for SDKs such as AMDs APP SDK libOpenCL.so and nVidias libOpenCL.so within CUDA :) This also works fine with the OpenCL ICD loader I assume?
https://github.com/CudaWrangler/cuew
Wow, you are just a mine of useful information bhack. Its amazing. Are you are AI? :-P
@bhack @hughperkins More importantly, why does bhack know that much and take all the time to respond? :) amazing... but yeah oh well, cuew and clew are two more items to integrate into libDNN then... endless list of work :)
This also works fine with the OpenCL ICD loader I assume?
yes. the sequence is:
clew => loads libopencl.so => reads icd => loads vendor drivers => loads device information etc
@edgarriba Can you post the UML rendered image of the integration proposal?
I don't think that Caffe it is the right testbed for libdnn cause libdnn was in some sense designed around Caffe. So if you want to give some feedback on Edgar design...
http://uml.mvnsearch.org/gist/df8ae6e567cb85ea62e7d20a8081f89a
feedback is welcomed!
Another interesting roadmap to monitor is https://phabricator.pmoreau.org/w/mesa/opencl_through_spirv_current_status/
@CNugteren I think that you could be interested in the last messages.
As you can see we have started to use CLCudaAPI for Tiny but also integrating libdnn for convolutional kernels. @CNugteren told that he was interested to contribute in libdnn. By this GSoC integration experiment I see a little bit of duplication of features. Libdnn actually maintain its tuning class and Cedric has CLtune. Libdnn uses ViennaCL as helper class for Opencl/cuda and Cedric has CLCudaAPI. Libdnn use some algebra from Vienna and Cedric from CLBlast. Is there a little bit of efforts duplication?
@bhack Yes there certainly is a little bit of duplication, but I don't think that's an issue. Inside Caffe I even use the duplication as an advantage and offer clBLAS, CLBlast and ViennaCL for the linear algebra part. It's just that especially with the algebra, different libraries perform good or bad on different devices.
We cannot share the tuning component? Also, I think CLCudaAPI it is neutral enougth to use Vienna and CLBast algebra.
@bhack No the way the code generation works in libDNN is too different from the CLtune approach I think. Yes, when you use CLCudaAPI you can also make use of multiple BLAS libraries, of course.
I think naibaf7 is right: libDNN generates its OpenCL kernels from C++ code, whereas CLTune assumes that you already wrote a complete and functioning OpenCL kernel with tuning parameters exposed through the pre-processor (#define and #if - #else - #endif). Re-writing libDNN in this way is of course possible, but if the tuning is already working as it is now, why bother?
Using CLCudaAPI is orthogonal to using CLTune of course. It is just a way to make using OpenCL from C++ a lot more friendlier with the added bonus of making porting to CUDA as easy as changing a header.