torchsort
torchsort copied to clipboard
Memory Issue with torchsort.soft_rank on CUDA
I encountered a CUDA memory error when using torchsort.soft_rank during parallel training on the GPU. The error message is as follows: File "/home/xxx/anaconda3/envs/DL2/lib/python3.10/site-packages/torchsort/ops.py", line 121, in backward ).gather(1, inv_permutation) RuntimeError: CUDA error: an illegal memory access was encountered
However, when I switched the code to run torchsort.soft_rank on the CPU, while keeping the other parts of the code on the GPU, the error disappeared.
For example, when I modify the code like this: if pred_2d.device != 'cpu': pred_2d = pred_2d.to('cpu') rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1) rank_2d = rank_2d.to(y_true.device)
The error is resolved. But if I directly run: rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1) The error occurs again.
Could you provide guidance on how to resolve this issue when using torchsort.soft_rank with CUDA? Thank you so much!