Rob Armstrong
Rob Armstrong
cudaDeviceSynchronize() is not a device-side API and wouldn't make sense for it to be called from within a device function - by definition, if the function is waiting for a...
The `cuda_driver_api.h` header was removed from CUDA Toolkit a long time back and its content was refactored into other headers - I'm not 100% certain which release offhand, but I...
I would ask in a forum relating to Ollama. CUDA hasn't provided that header in some time. They may have a similarly-named header, but I'm not able to speak to...
As a quick note with the update to CMake this should hopefully work better in the 12.8 release and newer - please let us know how it goes.
Given the age of this issue will close as resolved.
Ok, will re-open and take a closer look.
Given the age of this issue will close as resolved.
Sorry, should have been more clear. I'm doing a cleanup pass through the repo and took note of the issue to get it fixed, but since the last note was...
Hi @Zeyu-W, thanks for reporting this issue. I agree multiplying by K here is incorrect. But, SHMEM_STRIDE is already defined as `N * BLOCK_ROW_TILES`, I think also multiplying by M...
Thanks for your reply - let me take another look at it. This isn't originally my code so I may have misread it when I was initially looking at it.