torchsort icon indicating copy to clipboard operation
torchsort copied to clipboard

Memory Issue with torchsort.soft_rank on CUDA

Open liuquant opened this issue 5 months ago • 6 comments

I encountered a CUDA memory error when using torchsort.soft_rank during parallel training on the GPU. The error message is as follows: File "/home/xxx/anaconda3/envs/DL2/lib/python3.10/site-packages/torchsort/ops.py", line 121, in backward ).gather(1, inv_permutation) RuntimeError: CUDA error: an illegal memory access was encountered

However, when I switched the code to run torchsort.soft_rank on the CPU, while keeping the other parts of the code on the GPU, the error disappeared.

For example, when I modify the code like this: if pred_2d.device != 'cpu': pred_2d = pred_2d.to('cpu') rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1) rank_2d = rank_2d.to(y_true.device)

The error is resolved. But if I directly run: rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1) The error occurs again.

Could you provide guidance on how to resolve this issue when using torchsort.soft_rank with CUDA? Thank you so much!

liuquant avatar Aug 31 '24 00:08 liuquant