javacpp-presets Merging with JCuda and JOpenCL projects for better quality cuda interfaces

@saudet Hi buddy, it just came to my head in last few weeks what about merging cuda and opencl stuff here with work of guys from Jcuda and Jopencl projects. I understand there are some fundamental differences, but having rather more quality devs on single project could provide project quality as well.

The guys from JCuda opened discussion on my request here: https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538

So, if you think it could bring more value as well, you are free to join the discussion.

Oct 08 '17 15:10 archenroot

That would be nice, but the problem is that people expect Oracle to come up with a better solution than JavaCPP, even though they are not working on anything at the moment. As far as I can tell, the developers of Project Panama have given up on any generic solution to C++, no one knows how to make something better than JavaCPP. Still, they hope and believe, and wait, mostly. If you could help convincing them that nothing better is going to happen, explaining and reexplaining over and over again how JavaCPP could get better, that would be the first thing that needs to be done.

Oct 08 '17 15:10 saudet

Just for reference: https://github.com/jcuda/jcuda/issues/12#issuecomment-335010118

Oct 08 '17 16:10 archenroot

@saudet I am reading about it it actually started long time ago..(the panama project). What is that project based on, JNI or something new?

Anyway I registered under the project mail list, but to be hones I went trough some links on project site, some repository links are broken, and also blog sites of the main creators/devs are not updated for long time... I will check more and read about this.

Oct 08 '17 16:10 archenroot

There is a lot happening in Panama (the project...) right now. Admittedly, although I'm registered to the mailing list, too much to follow it all in detail. However, if they manage to achieve the goals that they stated at the project site, http://openjdk.java.net/projects/panama/ , this would certainly compete with JavaCPP.

Of course, development there happens at a different pace. We all know that a "single-developer project" often can be far more agile than a company-driven project, where specifications and sustainability play a completely different role. Panama also approaches topics that go far beyond what can be accomplished by JavaCPP or JNI in general. They are really going down to the guts, and the work there is interwoven with the topics of Value Types, Vectorization and other HotSpot internals.

So I agree to saudet that it does not make sense to (inactively) "wait for a better solution". JavaCPP is an existing solution for (many of, but by no means all of) the goals that are addressed in Panama.

More generally speaking, the problem of fragmentation (in terms of different JNI bindings for the same library) occurred quite frequently. One of the first "large" ones had been OpenGL, with JOGL which basically competed with LWJGL. For CUDA, there had been some very basic approaches, but none of them (except for JCuda) have really been maintained. When OpenCL popped up, there quickly have been a handful of Java bindings (some of them being listed at jocl.org and in this stackoverflow answer ), but I'm not sure about how actively each of them is still used and maintained.

(OT: It has been a bit quiet around OpenCL in general recently. Maybe due to Vulkan, which also supports GPU computations? When Vulkan was published, I registered jvulkan.org, but the statement "Coming soon" is not true any more: There already is a vulkan binding in LWJGL, and the API is too complex to create manual bindings. There doesn't seem to be a Vulkan preset for JavaCPP, or did I overlook it?)

For me, as the maintainer of jcuda.org and jocl.org, one of the main questions about "merging" projects would be how this can be done "smoothly", without just abandoning one project in favor of the other. I always tried to be backward compatible and "reliable", in that sense. Quite a while ago, I talked to one of the maintainers of Jogamp-JOCL, about merging the Jogamp-JOCL and the jocl.org-JOCL. One basic idea there had been to reshape one of the libraries so that it could be some sort of "layer" that is placed over the other, but this idea has not been persued any further.

I'm curious to hear other thoughts and ideas about how such a "merge" might actually be accomplished, considering that the projects are built on very different infrastructures.

Oct 08 '17 18:10 jcuda

I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks

Oct 09 '17 00:10 saudet

Yes, JCuda, etc could be rebased on JavaCPP, that's the idea IMO. There are no bindings for OpenCL or Vulkan just because I don't have the time to do everything, that's all.

Oct 09 '17 00:10 saudet

@jcuda @saudet Little offtopic, but related: I am very interested in JNR, but to be honest, I wasn't able to find any kind of benchmarking or even some detailed comparison. Before we had JNA and JNI, while JNA was slow and easy to use, but for high-performance stuff you need performance, so we go with JNI where possible, right? That is also the way of JavaCPP and JCuda as well. Could you guys put here some reference document comparing JNR to JNI from performance perspective? I would love to understand the internal architecture of JNR to see especially performance benefits over JNI, I am aware it is far beyond performance only, but when you run 200 node cpu/gpu cluster, the performance (throughput and latency) matters. The complexity of adoption could be handled always :-)

Oct 09 '17 09:10 archenroot

I know about these links for JNR: http://www.oracle.com/technetwork/java/jvmls2013nutter-2013526.pdf https://github.com/bytedeco/javacpp/issues/70

Oct 09 '17 09:10 saudet

@saudet thanks buddy,

I also suggest to move the discussion about jcuda vs javacpp to marco's thread at, as he requested: https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538/3

NOTE: I think out of theoretical discussion, as performance is the top priority I suggest if you @saudet create under JavaCPP new github project where we can develop real benchmark for Jcuda and Javacpp based CUDA (as Vulkan and OpenCL are not available in the moment), so we can analyze code syntax diff/similarities and performance as well in some unified way.

I also suggest to decide which benchmark framework should be used to build this stuff:

JMH
http://www.spf4j.org/
http://labs.carrotsearch.com/junit-benchmarks.html

Or here: https://stackoverflow.com/questions/7146207/what-is-the-best-macro-benchmarking-tool-framework-to-measure-a-single-threade

Oct 09 '17 09:10 archenroot

Sure, but who will take time to do? I keep telling everyone I don't have the time to do everything by myself...

Oct 09 '17 09:10 saudet

I will create initial project and adopt few basic CUDA algorithms to be implemented in Jcuda and javacpp, I hope we could find more users from the other side (jcuda) to participate as well.

Oct 09 '17 10:10 archenroot

Ok, cool, thanks! Can we name the repo "benchmarks"? or would there be a better name?

Oct 09 '17 10:10 saudet

I think make it this generic best, so benchmarks sounds good. As out of of this I would like to also later (if having time) to test JavaCPP vs JNR in some simple dummy getuuid functions call tests from libc as kind of template:

#include <uuid/uuid.h>
void uuid_generate(uuid_t out);void uuid_generate_random(uuid_t out);void
uuid_generate_time(uuid_t out);int uuid_generate_time_safe(uuid_t out);

Oct 09 '17 11:10 archenroot

I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks

Again, I'm not so deeply involved there, but their primary goal is (to my understanding) not something that is based on accessing libraries via their definitions in header files. My comment mainly referred to the high-level project goals (i.e. accessing native libraries, basically regardless of which language they have been written in), together with the low-level efforts in the JVM. At least, there are some interesting threads in the mailing list, and the repo at http://hg.openjdk.java.net/panama/panama/jdk/shortlog/d83170db025b seems rather active.

Regarding the benchmarks: As I also mentioned in the forum, creating a sensible benchmark may be difficult. Even more so if it is supposed to cover the point that is becoming increasingly important, namely multithreading. But setting up a basic skeleton with basic sample code could certainly help to figure out what can be measured, and how it can be measured sensibly.

(As for the topic of merging libraries, the API differences might actually be more important, but this repo would automatically serve this purpose, to some extent - namely, by showing how the same task is accomplished with the different libraries)

Oct 09 '17 17:10 jcuda

@jcuda

Thanks for your comments. Actually based on the presentation it even looks they have added even more processing layers than JNI has :-))), but I will need to investigate the whole story more. Thanks for link.

Regarding benchmark: that is the point, establishing kind of skeleton. By multithreading you mean CPU multithreading?, I think it will be good along with template definition to discuss possible algorithms to be implemented and their general specification. good point.

That is exactly the point, because I also do not now how the differences are big in the moment, how big breakthrough we talk about.

Oct 09 '17 20:10 archenroot

@archenroot I created the repository and gave you admin access: https://github.com/bytedeco/benchmarks Feel free to arrange it as you see fit and let me know if you need anything else! Thanks

Oct 09 '17 20:10 saudet

@saudet good starting point, I will try to do as discussed: prepare common benchmark structure/template and list of interesting algorithms (including of course multi-threaded from client perspective).

I am thinking to in some cases provide as well existing C/C++ implementation if available to compare native performance, but will focus on Jcuda vs javacpp at first.

Thanks again.

Oct 09 '17 23:10 archenroot

By multithreading you mean CPU multithreading?

Yes. CUDA offers streams and some synchronization methods that are basically orchestrated from client side. (This may involve stream callbacks, which only have been introduced in JCuda recently, an example is at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/driver/samples/JCudaDriverStreamCallbacks.java )

As for the other "benchmarks": Some simple matrix multiplication could be one that creates a real workload. Others might be more artificial, in order to more easily tune the possible parameters. Just a rough example: One could create a kernel that just operates on a set of vector elements. Then one could create a vector with 1 million entries, and try different configurations - namely, copying X elements and launching a kernel with grid size X (1000000/Y times). This would mean

process 100-element blocks, using 10000 copies/launches
process 1000-element blocks, using 1000 copies/launches
process 10000-element blocks, using 100 copies/launches
process 100000-element blocks, using 10 copies/launches

(the kernel itself could then also be "trivial", or create a real workload by throwing in some useless sin(cos(tan(sin(cos(tan(x))))) computations...)

Again, this is just a vague idea.

Oct 10 '17 18:10 jcuda

FWIW, being able to compile CUDA kernels in Java is something we can do easily with JavaCPP as well. To get a prettier interface, we only need to finish what @cypof has started in https://github.com/bytedeco/javacpp/pull/138.

Oct 22 '17 00:10 saudet

@archenroot @jcuda May I add that the actual computation time of the GPU kernels is not that important for the benchmarks. What we need to measure here is an overhead over plain C/C++ cuda driver calls.

So, let's say that enqueing the "dummy" kernel costs X time. Java wrapper needs k * X time. We are interested in knowing k1 (JCuda) and k2 (JavaCPP cuda), k1*X/X, k2*X/X or/and k1 * X / k2 * X.

In my opinion, k1 * X / k2 * X is the easiest to measure of those.

Oct 22 '17 18:10 blueberry

Compiling CUDA kernels at runtime already is possible with the NVRTC (a runtime compiler). An example is in https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcVectorAdd.java . (Of course one could add some convenience layer around this. But regarding the performance, the compilation of kernels is not relevant in most use cases). I'll have a look at the linked PR, though.

Oct 22 '17 20:10 jcuda

@jcuda Oh, interesting. It's nice to be able to do this with C++ in general and not only CUDA though.

Oct 23 '17 22:10 saudet

In fact, the other sample at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcLoweredNames.java shows that this also supports "true" C++, with namespace, templates etc.

(The sample does not really "do" anything, it only shows how the mangled names may be accessed afterwards).

The NVRTC was introduced only recently, and before it was introduced, one problem indeed was the lack of proper C++ support for kernels in JCuda: It was possible to compile kernels that contained templates by using the offline CUDA compiler (which is backed by a C++ compiler like that of Visual Studio). The result was a PTX file with one function for each template instance. But of course, with oddly mangled names that had to be accessed directly via strings from Java. With the NVRTC, this problem is at least alleviated.

Oct 24 '17 22:10 jcuda

But it doesn't help for C++ code running on the host, right? So, if I understand correctly, NVRTC doesn't help for something like Thrust: https://github.com/bytedeco/javacpp/wiki/Interface-Thrust-and-CUDA

Oct 24 '17 23:10 saudet

That's right. And the question was already asked occasionally, aiming at something like "JThrust". But I think that the API of thrust (which on some level is rather template-heavy) does not map sooo well to Java. I think that a library with a functionality that is similar to that of thrust, but in a more Java-idiomatic way.

(A while ago I considered to at least create some bindings for https://nvlabs.github.io/cub/ , as asked for in https://github.com/jcuda/jcuda-main/issues/11 , but I'm hesitating to commit to another project - I'm running out of spare time....)

Oct 25 '17 01:10 jcuda

@jcuda @archenroot @blueberry FYI, wrapper overhead might become more important since kernel launch overhead has apparently been dramatically reduced with CUDA 9.1:

Launch kernels up to 12x faster with new core optimizations

https://developer.nvidia.com/cuda-toolkit/whatsnew

Jan 15 '18 04:01 saudet

They don't give any details/baseline of what they compared. A dedicated benchmark or comparison with CUDA 9.0 and 9.1 might be worthwhile. (I haven't updated to 9.1 yet - currently, the Maven release of 9.0 is on its way...)

@archenroot Any updates on the benchmark repo?

Jan 16 '18 18:01 jcuda

In the meantime, I've released presets for CUDA 9.1 :) http://search.maven.org/#search%7Cga%7C1%7Cbytedeco%20cuda

Jan 17 '18 05:01 saudet

@jcuda - I am unfortunately busy with other projects in the moment and preparing for relocation with family in next 2 months, so I am not seeing the benchmark progress as feasible from my side in next 2-3 months...

@saudet - you are deadly warrior :-) thx for update.

Jan 17 '18 06:01 archenroot

FYI, commit https://github.com/bytedeco/javacpp-presets/commit/916b06032ecd00970c1bd8d2c2ac6bc7ac05e665 reduces the JNI wrapper overhead even further.

Feb 15 '18 12:02 saudet

javacpp-presets javacpp-presets copied to clipboard

Merging with JCuda and JOpenCL projects for better quality cuda interfaces

javacpp-presets
javacpp-presets copied to clipboard