cuda-python Correct usage of `cuda.core.

Correct usage of `cuda.core._memory.Buffer`?

Open carterbox opened this issue 6 months ago • 6 comments

I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:

device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()

Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:

cublaslt.matmul(
    self.handle,
    self.mm_desc,
    self.alpha.ctypes.data,
    a.data_ptr,
    self.a_layout_ptr,
    b.data_ptr,
    self.b_layout_ptr,
    self.beta.ctypes.data,
    c_ptr,
    self.c_layout_ptr,
    self.result.data_ptr,
    self.d_layout_ptr,
    algorithm_struct.ctypes.data,
    raw_workspace_ptr,  # pointer here
    self.workspace_size,  # same size used here as to allocate the buffer
    stream_holder.ptr,
)

The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:

========= Invalid __global__ write of size 4 bytes
=========     at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1ac5d49420 is out of bounds
=========     and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x7ee3eb5] in libcublasLt.so.12
=========         Host Frame:  [0x7f4a3f7] in libcublasLt.so.12
=========         Host Frame:  [0x1b1ab14] in libcublasLt.so.12
=========         Host Frame:  [0x1b1c010] in libcublasLt.so.12
=========         Host Frame:  [0xf81c1d] in libcublasLt.so.12
=========         Host Frame:  [0x10c0b58] in libcublasLt.so.12
=========         Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
=========         Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so

It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.

Looking for opinions on:

Whether I am allocating / using this Buffer in the expected manner
How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt

Apr 11 '25 01:04 carterbox

cuda-python cuda-python copied to clipboard

Correct usage of `cuda.core._memory.Buffer`?

cuda-python
cuda-python copied to clipboard