jcuda-main Size constraint for cudaMallocHost

Not sure if this has been brought up before, but I ran into an issue while trying to allocate host memory with size larger than MAX_INT. E.g. running:

JCuda.setExceptionsEnabled(true);
JCuda.cudaMallocHost(pointer, 2384040000L);

gives:

java.lang.IllegalArgumentException: capacity < 0: (-1910927296 < 0)
	at java.base/java.nio.Buffer.createCapacityException(Buffer.java:256)
	at java.base/java.nio.Buffer.<init>(Buffer.java:220)
	at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:281)
	at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:289)
	at java.base/java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:90)
	at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:158)
	at jcuda.runtime.JCuda.cudaMallocHostNative(Native Method)
	at jcuda.runtime.JCuda.cudaMallocHost(JCuda.java:4313)

After digging around this seems to be caused by the fact that although cudaMallocHost takes a long size argument (which corresponds to the native cuda size_t), Java Buffers only have integer capacity. When it tries to create a Java DirectByteBuffer with the JNI NewDirectByteBuffer call the size is cast back to an int. The capacity then ends up overflowing to a negative value, causing the error.

Currently the workaround for me seems to be splitting the allocation into multiple chunks, so as to keep each one below the limit. Some options I can think of to deal with this issue in JCuda are (in increasing order of difficulty):

Update the documentation to list as a known limitation
Check for values larger than the limit and give a more descriptive error/warning
Avoid using Java Buffers, instead use some custom class which wraps the pinned memory and supports indexing with a long value etc

Please let me know if I've just made a silly mistake somewhere. Thanks for your time, I appreciate all the work that goes into this :)

Edit: just found this reply to a forum post, indicating that you might be aware of the issue

Aug 03 '22 00:08 NotLoose

I think the linked forum question mainly refers to cudaHostAlloc having some limits on the CUDA side. Even if it accepts a long value, and even if you have 32GB of RAM, this does not necessarily mean that you can allocate 16GB at once. (The exact limit is not specified, IIRC, but that thread is a few years old, and I might need a refresher here).

Beyond that, I'm generally aware of the difficulties related to "memory sizes" and the different sizes of size_t and int. I usually try to handle that insofar that a size_t or long is usually translated to a Java long (which should cover most memory sizes).

But there still are some caveats in the interoperation with Java. And what you just mentioned is such a caveat. The first options (add documentation, and do a sanity check) are things that I'll certainly consider.

What you suggested as option 3 could be broken down into two steps:

Treating the allocation like that of sun.misc.Unsafe#allocateMemory: It could return the long address from the internal allocation, and users could have fun with Unsafe and manipulate the data directly, as they want
On top of that long address, there could be some sort of convenience class to access this memory

Now, one reason for me to have used the ByteBuffer was exactly that this is such a "convenience class". Currently, the main (or rather the only) way to access the data that is allocated with cudaMallocHost is by obtaining the byte buffer via Pointer#getByteBuffer. And I think that having the option to write pointer.getByteBuffer().asFloatBuffer().put(myJavaFloatArray) is important.

It is true that this limits the size of the host allocation, due to the limitation of ByteBuffer that you linked to in the JDK code, and in its interface in general. But every change that would avoid the use of ByteBuffer would require some larger restructuring here. (It might be possible to solve that in a backward compatible way, but it may not be entirely trivial).

I might consider to expose the "address" from cudaMallocHost more directly. (It could just be some public static long cudaMallocHostAndReturnAddress, FWIW....). But would hesitate to start writing any sort of "convenience class" on top of that. It would either be "inconvenient" (similar to Unsafe), or an attempt to emulate the roughly 100 ...Buffer... classes in the java.nio package that offer all the sorts of "convenience" that users may want...

If you strongly need that 'raw address', I could probably add such a function in the next release. But for now, the workaround of breaking the allocation into chunks probably makes more sense.

Aug 03 '22 19:08 jcuda

Yep I was mainly referring to your brief aside at the bottom of the reply that I linked, where you mentioned the fact that Java arrays etc are indexed with integers.

But yes I agree with what you're saying, I can see some merit in being able to access the address directly and using Unsafe from there, however at this point I think I will just stick with chunking the allocation. ByteBuffer and its related classes are definitely convenient to use, and trying to reinvent them to work with long indexing would be a big task, for rather limited gain.

Happy to stick with the workaround for now, just thought I would point out the issue in case I was missing some easy way to fix it. Thanks again for your time!

Aug 03 '22 21:08 NotLoose

Websearches indicate that there are some attempts to provide a BigByteBuffer with long indices, but there doesn't seem to be a standard solution. There's also the question of what one could do with such a buffer on the Java side. Iterating over them with a good old for-loop can hardly be the goal. So there would have to be some sort of "bulk" operations that either copy their contents into arrays or buffers that are then processed further, or (preferably) methods that offer views on the data with the standard ByteBuffer interface - as in ByteBuffer chunk = bigByteBuffer.slice(longStartIndex, longEndIndex);

In any case, the point about checking the limit in cudaMallocHost and documenting it is a valid one. I'll try to do that for the next release.

Aug 03 '22 22:08 jcuda

Yeah it looks like there were some BigByteBuffer proposals in the past that didn't end up being brought to life unfortunately.

For my current use case I am in fact just iterating over the buffer with a for loop to directly generate the data to send to the GPU, as I have no need to hold that data in a Java array first (so I'm saving memory space, and time by avoiding the extra memory copy). So the for loop method of access is certainly useful. The bulk operations and views on the data that you mention would definitely make life easier in some cases, but I wouldn't say they are absolutely essential.

But yes for now the documentation/limit checking would be great thanks :)

Aug 04 '22 01:08 NotLoose

jcuda-main jcuda-main copied to clipboard

Size constraint for cudaMallocHost

jcuda-main
jcuda-main copied to clipboard