somoclu
somoclu copied to clipboard
Reimplement GPU kernels in CUDA to replace Thrust
The Thrust kernels are somewhat inefficient for the multidimensional data structures. Testing is in the cuda_opt branch.
Is there a working Thrust-Free version of the Somoclu Python package available? I am experiencing an issue similar to #89, where the GPU training is slower than the CPU. While I would think that it might benefit from scaling, If I try to use more than a few hundred points, it throws the following error:
terminate called after throwing an instance of 'thrust::system::system_error' what(): device free failed: an illegal memory access was encountered Aborted
I am currently doing this with a 512x512 map w/ a feature vector length of 956. I can train with 256 points or so, but trying to 512 fails. With 256 points it is slightly (3 seconds) faster than the dense CPU kernel.
I am currently running Debian Buster, with a M1000M Quadro GPU using the Nvidia 410.48 drivers and CUDA 10.0 installed via the cuda_10.0.130_410.48_linux.run file provided by NVIDIA.
Thanks for any help you might be able to give.
...any status on this? It seems like Thrust implementation has been taken over by CUDA (see https://www.reddit.com/r/cpp/comments/7erub1/anybody_still_using_thrust/ ), so I'm curious if there's a way to call the updated routines in CUDA without having to fundamentally rewrite the kernels...
My understanding is that this is why the behavior in #89 happens, so it'd be great to use more modern GPU functionality (especially re: memory management).