torchrec icon indicating copy to clipboard operation
torchrec copied to clipboard

CUDA error: an Illegal memory access was encountered

Open ali-fani-sd opened this issue 7 months ago • 0 comments

We are using DLRM model for personalization and we are getting CUDA error. By setting up CUDA_LAUNCH_BLOCKING flag and enabling cuda core dump, it pointed to two files where the issue might be happening 1: torchrec/distributed/embeddingbag.py: input_dist 2:torchrec/sparse/jagged_tensor.py: permute()

Some of our jaggedtensors are using weights, so when we debug the Jagged_tenosor.py we see mismatch in values(permuted length per key sum) and weights. Do you think that could be the root cause of CUDA error.

ali-fani-sd avatar May 07 '25 22:05 ali-fani-sd