torchrec
torchrec copied to clipboard
CUDA error: an Illegal memory access was encountered
We are using DLRM model for personalization and we are getting CUDA error. By setting up CUDA_LAUNCH_BLOCKING flag and enabling cuda core dump, it pointed to two files where the issue might be happening 1: torchrec/distributed/embeddingbag.py: input_dist 2:torchrec/sparse/jagged_tensor.py: permute()
Some of our jaggedtensors are using weights, so when we debug the Jagged_tenosor.py we see mismatch in values(permuted length per key sum) and weights. Do you think that could be the root cause of CUDA error.