libdnn integration to tinny-cnn

@naibaf7 @bhack

I open this ticket in order to discuss ideas for integration libdnn to tiny-cnn.

Currently, I implemented a small interface to get native OpenCL context from tiny-cnn: https://github.com/edgarriba/tiny-cnn/blob/f4d9e1d4f45ad8ac46824b5c007f5507fe8925eb/tiny_cnn/core/session.h

Things I think that are needed:

Implement a module for data transfer between devices
Discuss the shape of libdnn simplified interface if is needed.

BTW, @naibaf7 @hughperkins notice that we are planing to migrate tiny-cnn to an organization account and renaming the library itself since now it's more a pure DNN lib than just CNN. Maybe you are interested in getting more involved in the development. https://github.com/nyanp/tiny-cnn/issues/235

Jul 18 '16 10:07 edgarriba

@edgarriba Ok, I'll look into it. What is your idea about the data transfer and simplified interface?

Jul 18 '16 11:07 naibaf7

@naibaf7 for memory syncing I was thinking in an approach similar to TF. What do you think it's the best to start? https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L46

Regarding simplified interface, just to make clear what type of data will need libdnn interfaces.

Jul 18 '16 15:07 edgarriba

@naibaf7 What do you think of this design?

Jul 18 '16 20:07 bhack

@bhack Not bad, but essentially no different from functions already existing in either ViennaCL or Caffe code. It's really unfortunate that tiny-cnn can't handle any GPU related management itself at the moment, as both cuDNN and libDNN expect a proper GPU & memory initialization beforehand (and I oriented the libDNN interface closely to the cuDNN one).

Jul 18 '16 20:07 naibaf7

What do you want to be our responsabilities? Device, memory and context? Kernel launch?

Jul 18 '16 21:07 bhack

@bhack If you want compatibility with typical GPU libraries, then optimally device, memory and context. Kernel launch is handled enclosed in both cuDNN and libDNN, because this needs tricky parameter selection.

But if you can wait 2 days, I will update libDNN with simplifications for the device, memory and context part (basically copied over from Caffe). Basically just wrappers for memory allocation, device listing & initialization & memory transfer around CUDA and ViennaCL.

Jul 18 '16 21:07 naibaf7

We have no problem to handle device, memory and context. If you think that would be useful to have this features here ok. If not we will implement this in tiny. We need to think at the largest audience for this library. So if you think that generally callers benefit for handling context, device and memories itself for other kind of operations it is ok for us to implement in tiny and go ahead with other kernel coverage porting here.

Jul 18 '16 21:07 bhack

@naibaf7 I leave here the error report I get: http://pastebin.com/yv6rmu21

Jul 27 '16 09:07 edgarriba

@edgarriba What steps do you use to test the integration (just so I can replicate)...? The error looks familiar from early OpenCL Caffe tests, so it should be something I can fix.

Jul 27 '16 09:07 naibaf7

with this branch: https://github.com/edgarriba/tiny-cnn/tree/libdnn

  cmake -DBUILD_TESTS=ON -DUSE_OPENCL=ON -DUSE_LIBDNN=ON -DUSE_TBB=ON ..
  make && ./test/tiny_cnn_test

important routines: https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/backend_dnn.h#L76 https://github.com/edgarriba/tiny-cnn/blob/libdnn/tiny_cnn/core/kernels/libdnn_conv2d_kernel.h#L54

Jul 27 '16 09:07 edgarriba

@edgarriba Ok I gained a good understanding of what you are trying to do and what issues the code has. I filed a pull request on your tinycnn/libdnn branch which fixes some of the issues and explains some details of the problems that still exist.

If you need more assistance in fixing these, or want to discuss it in more details, we can schedule a Skype conversation with screen sharing and revise the code that way.

Jul 28 '16 04:07 naibaf7

By the way, random observation, not sure if this is useful or not, but if you use the clew abstraction layer, then you can build for opencl, without libOpenCL.so being present, and you can even run your program, without libOpenCL.so being present. Your program can make a runtime-decision about whether to try binding with opencl or not. Actually, I think it can even attempt to call clewInit(), and thus detect if libOpenCL.so is present, and so on, and deal accordingly. (not sure about this last sentence, would be easy to fix this if it's not actually the case)

Aug 03 '16 14:08 hughperkins

@hughperkins Intriguing! Does that also work with CUDA? Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :) What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

Aug 03 '16 16:08 naibaf7

Does that also work with CUDA?

No

Because it would be really useful if there could be a universal binary that has no fixed dependencies beyond C++11 :)

:-) Well, I geuss someone could create something analagous for cuda. Maybe... you? :-)

What about multi-location search for libOpenCL.so and nvrtc.so/cuda.so? Can it figure out these different locations?

For libOpenCL.so, it searches in a couple of hard-coded locations:

https://github.com/hughperkins/clew/blob/master/src/clew.c#L180-L184

    module = CLEW_DYNLIB_OPEN("OpenCL.dll");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/Library/Frameworks/OpenCL.framework/OpenCL");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("libOpenCL.so.1");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so");
    if(module == 0) module = CLEW_DYNLIB_OPEN("/usr/lib/libOpenCL.so.1");

You can see that there is no reason why this couldnt be extended arbitrarily and/or read from some config file. But this covers almost all, or perhaps all?, cases I've seen, I think?

Aug 04 '16 02:08 hughperkins

@hughperkins Yeah, I see... yes maybe you could add possible locations for SDKs such as AMDs APP SDK libOpenCL.so and nVidias libOpenCL.so within CUDA :) This also works fine with the OpenCL ICD loader I assume?

Aug 04 '16 12:08 naibaf7

https://github.com/CudaWrangler/cuew

Aug 04 '16 21:08 bhack

Wow, you are just a mine of useful information bhack. Its amazing. Are you are AI? :-P

Aug 05 '16 00:08 hughperkins

@bhack @hughperkins More importantly, why does bhack know that much and take all the time to respond? :) amazing... but yeah oh well, cuew and clew are two more items to integrate into libDNN then... endless list of work :)

Aug 05 '16 00:08 naibaf7

This also works fine with the OpenCL ICD loader I assume?

yes. the sequence is:

clew => loads libopencl.so => reads icd => loads vendor drivers => loads device information etc

Aug 05 '16 00:08 hughperkins

@edgarriba Can you post the UML rendered image of the integration proposal?

Aug 05 '16 06:08 bhack

I don't think that Caffe it is the right testbed for libdnn cause libdnn was in some sense designed around Caffe. So if you want to give some feedback on Edgar design...

Aug 05 '16 06:08 bhack

http://uml.mvnsearch.org/gist/df8ae6e567cb85ea62e7d20a8081f89a

Aug 05 '16 07:08 edgarriba

feedback is welcomed!

Aug 05 '16 07:08 edgarriba

Another interesting roadmap to monitor is https://phabricator.pmoreau.org/w/mesa/opencl_through_spirv_current_status/

Aug 05 '16 11:08 bhack

@CNugteren I think that you could be interested in the last messages.

Aug 05 '16 20:08 bhack

As you can see we have started to use CLCudaAPI for Tiny but also integrating libdnn for convolutional kernels. @CNugteren told that he was interested to contribute in libdnn. By this GSoC integration experiment I see a little bit of duplication of features. Libdnn actually maintain its tuning class and Cedric has CLtune. Libdnn uses ViennaCL as helper class for Opencl/cuda and Cedric has CLCudaAPI. Libdnn use some algebra from Vienna and Cedric from CLBlast. Is there a little bit of efforts duplication?

Aug 06 '16 09:08 bhack

@bhack Yes there certainly is a little bit of duplication, but I don't think that's an issue. Inside Caffe I even use the duplication as an advantage and offer clBLAS, CLBlast and ViennaCL for the linear algebra part. It's just that especially with the algebra, different libraries perform good or bad on different devices.

Aug 06 '16 11:08 naibaf7

We cannot share the tuning component? Also, I think CLCudaAPI it is neutral enougth to use Vienna and CLBast algebra.

Aug 06 '16 12:08 bhack

@bhack No the way the code generation works in libDNN is too different from the CLtune approach I think. Yes, when you use CLCudaAPI you can also make use of multiple BLAS libraries, of course.

Aug 06 '16 12:08 naibaf7

I think naibaf7 is right: libDNN generates its OpenCL kernels from C++ code, whereas CLTune assumes that you already wrote a complete and functioning OpenCL kernel with tuning parameters exposed through the pre-processor (#define and #if - #else - #endif). Re-writing libDNN in this way is of course possible, but if the tuning is already working as it is now, why bother?

Using CLCudaAPI is orthogonal to using CLTune of course. It is just a way to make using OpenCL from C++ a lot more friendlier with the added bonus of making porting to CUDA as easy as changing a header.

Aug 06 '16 12:08 CNugteren

libdnn libdnn copied to clipboard

integration to tinny-cnn

libdnn
libdnn copied to clipboard