jcuda-main
jcuda-main copied to clipboard
Size constraint for cudaMallocHost
Not sure if this has been brought up before, but I ran into an issue while trying to allocate host memory with size larger than MAX_INT. E.g. running:
JCuda.setExceptionsEnabled(true);
JCuda.cudaMallocHost(pointer, 2384040000L);
gives:
java.lang.IllegalArgumentException: capacity < 0: (-1910927296 < 0)
at java.base/java.nio.Buffer.createCapacityException(Buffer.java:256)
at java.base/java.nio.Buffer.<init>(Buffer.java:220)
at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:281)
at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:289)
at java.base/java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:90)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:158)
at jcuda.runtime.JCuda.cudaMallocHostNative(Native Method)
at jcuda.runtime.JCuda.cudaMallocHost(JCuda.java:4313)
After digging around this seems to be caused by the fact that although cudaMallocHost
takes a long size argument (which corresponds to the native cuda size_t
), Java Buffers only have integer capacity. When it tries to create a Java DirectByteBuffer
with the JNI NewDirectByteBuffer
call the size is cast back to an int. The capacity then ends up overflowing to a negative value, causing the error.
Currently the workaround for me seems to be splitting the allocation into multiple chunks, so as to keep each one below the limit. Some options I can think of to deal with this issue in JCuda are (in increasing order of difficulty):
- Update the documentation to list as a known limitation
- Check for values larger than the limit and give a more descriptive error/warning
- Avoid using Java Buffers, instead use some custom class which wraps the pinned memory and supports indexing with a long value etc
Please let me know if I've just made a silly mistake somewhere. Thanks for your time, I appreciate all the work that goes into this :)
Edit: just found this reply to a forum post, indicating that you might be aware of the issue
I think the linked forum question mainly refers to cudaHostAlloc
having some limits on the CUDA side. Even if it accepts a long
value, and even if you have 32GB of RAM, this does not necessarily mean that you can allocate 16GB at once. (The exact limit is not specified, IIRC, but that thread is a few years old, and I might need a refresher here).
Beyond that, I'm generally aware of the difficulties related to "memory sizes" and the different sizes of size_t
and int
. I usually try to handle that insofar that a size_t
or long
is usually translated to a Java long
(which should cover most memory sizes).
But there still are some caveats in the interoperation with Java. And what you just mentioned is such a caveat. The first options (add documentation, and do a sanity check) are things that I'll certainly consider.
What you suggested as option 3 could be broken down into two steps:
- Treating the allocation like that of
sun.misc.Unsafe#allocateMemory
: It could return thelong address
from the internal allocation, and users could have fun withUnsafe
and manipulate the data directly, as they want - On top of that
long address
, there could be some sort of convenience class to access this memory
Now, one reason for me to have used the ByteBuffer
was exactly that this is such a "convenience class". Currently, the main (or rather the only) way to access the data that is allocated with cudaMallocHost
is by obtaining the byte buffer via Pointer#getByteBuffer
. And I think that having the option to write
pointer.getByteBuffer().asFloatBuffer().put(myJavaFloatArray)
is important.
It is true that this limits the size of the host allocation, due to the limitation of ByteBuffer
that you linked to in the JDK code, and in its interface in general. But every change that would avoid the use of ByteBuffer
would require some larger restructuring here. (It might be possible to solve that in a backward compatible way, but it may not be entirely trivial).
I might consider to expose the "address" from cudaMallocHost
more directly. (It could just be some public static long cudaMallocHostAndReturnAddress
, FWIW....). But would hesitate to start writing any sort of "convenience class" on top of that. It would either be "inconvenient" (similar to Unsafe
), or an attempt to emulate the roughly 100 ...Buffer...
classes in the java.nio
package that offer all the sorts of "convenience" that users may want...
If you strongly need that 'raw address', I could probably add such a function in the next release. But for now, the workaround of breaking the allocation into chunks probably makes more sense.
Yep I was mainly referring to your brief aside at the bottom of the reply that I linked, where you mentioned the fact that Java arrays etc are indexed with integers.
But yes I agree with what you're saying, I can see some merit in being able to access the address directly and using Unsafe
from there, however at this point I think I will just stick with chunking the allocation. ByteBuffer
and its related classes are definitely convenient to use, and trying to reinvent them to work with long indexing would be a big task, for rather limited gain.
Happy to stick with the workaround for now, just thought I would point out the issue in case I was missing some easy way to fix it. Thanks again for your time!
Websearches indicate that there are some attempts to provide a BigByteBuffer
with long
indices, but there doesn't seem to be a standard solution. There's also the question of what one could do with such a buffer on the Java side. Iterating over them with a good old for
-loop can hardly be the goal. So there would have to be some sort of "bulk" operations that either copy their contents into arrays or buffers that are then processed further, or (preferably) methods that offer views on the data with the standard ByteBuffer
interface - as in
ByteBuffer chunk = bigByteBuffer.slice(longStartIndex, longEndIndex);
In any case, the point about checking the limit in cudaMallocHost
and documenting it is a valid one. I'll try to do that for the next release.
Yeah it looks like there were some BigByteBuffer
proposals in the past that didn't end up being brought to life unfortunately.
For my current use case I am in fact just iterating over the buffer with a for
loop to directly generate the data to send to the GPU, as I have no need to hold that data in a Java array first (so I'm saving memory space, and time by avoiding the extra memory copy). So the for
loop method of access is certainly useful. The bulk operations and views on the data that you mention would definitely make life easier in some cases, but I wouldn't say they are absolutely essential.
But yes for now the documentation/limit checking would be great thanks :)