Matthew Nicely
Matthew Nicely
There is a generic `__global__` kernel used [here](https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/device_kernel.h). The type definition passed as its argument is unique for all the different operations. All the kernel arguments are packed into a...
@qingyunqu were you able to determine the issue?
Thanks @lebedov for the update. If there's anything we (NVIDIA) can do to help please don't hesitate to ask :smile:
@znmeb Do you mind setting `export CUDA_VISIBLE_DEVICES=0` and rerunning *build.sh*?
I much easier workaround would be to allocate with CuPy's Managed Memory allocator (https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.ManagedMemory.html#cupy.cuda.ManagedMemory & https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.malloc_managed.html) This will allow the driver to migrate data back-and-forth between System and Device memory...
What would be the SDDMM use cases?
@rkindi has your issue been solved?
@yuxgis did you figure out your issue?
@zhanggefan were your questions resolved with @hwu36's response?
No CUTLASS bug, fixed in latest CUDA