Multi-GPU Support
https://github.com/benfred/implicit/blob/42df436936a92fe64b065d9b7c3a9da1adfeb8a3/implicit/cuda/als.cu#L129
Hi, I find this library really useful and would like to have it supporting multiple GPUs since I have pretty large number of users. Any idea how could we extend this?
This is something I'd like to add sometime soon (especially relevant since I recently started a new job at nvidia =).
Are you interested in multi-gpu support for faster training - or is this because you're running out of GPU memory when trying to store the embeddings for each user? Also which model are you targetting here - ALS or BPR ?
Whats your dataset size? (# of users / # of items / # of total interactions between them?).
Hi @benfred, I have ALS model with more than 40 mil. users and about 10k items with approx. 500 mil. ratings. The maximum memory I could get from one GPU is 16GB and we have two GPUs on each node. I tried to fit the model on one GPU but not successful. It would be great to split the computation on multiple GPUs.
I have no experiences with CUDA but I think splitting the implicit matrix & its computation on multiple GPUs could be done without much changing current implementation (correct me if i'm wrong) & I would like to hear your opinion on this :)
Actually I find implicit great when it comes to have a no-brainer but really fast ALS implementation. I currently use spark for this scale but it takes quite long time to train & predict.
I think reduce precision user/item factors might help here. If you're going with 64 factors per user, then storing the factors for all users will be around 10GB assuming fp32 - but we can get down to 5GB with fp16. I've created an issue here to track: https://github.com/benfred/implicit/issues/392
I think there are some challenges to getting this going on multiple GPU's, but this shouldn't be too hard:
- We'll need to enable peer access cudaDeviceEnablePeerAccess between each pair of gpu's
- we'll need to partition factors between gpus, and run training in parallel in each
- Calculating YtY will have to happen independently with cuBlas on each gpu, and then we'll have to sum up afterwards
- we'll need to slice the input interactions sparse matrix according to how the factors are partitioned. we should also plan for the case where the interactions can't fit onto GPU memory (and spill to host memory in this case).
There might also need to be some though about reducing communication between gpus. Like colocating frequently occuring items/users together (like spark mllib does for its ALS calculation) or caching popular items/users on both gpu's.
Have you thought about getting GPU's with more memory? A100's have 40GB on a single gpu =)
@benfred: yes, reducing the number of factor is what we did to fit it into our GPU. Having another GPU is a bit difficult since we rely on Google Cloud Platform.
Off topic question: Could you please recommend any good online course for learning CUDA programming?
@benfred hi, I also have a strong demand for multi-gpu support.
- OOM is my motivation to use multi-gpu like index_cpu_to_all_gpus() in faiss.
- I'm using ALS right now.
- Do you support specifying data type like np.float16 in gpu training, since that could reduce memory usage significantly?
- My data size: # of users = 159953032 / # of items = 151408354 / # of total interactions between them = 414543027.
Right now I'm stuck with my overwhelming data size which prevents me to train using gpu. Any advice?