spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Fix a NPE issue in GpuRand

Open firestarman opened this issue 1 year ago • 2 comments

close https://github.com/NVIDIA/spark-rapids/issues/11646

curXORShiftRandomSeed is marked as transient, so it will be null on executors without retry-restore context, leading to this NPE. This fix removes the transient for curXORShiftRandomSeed, seed and previousPartition that will be used by the computation on executors.

I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.

firestarman avatar Oct 23 '24 01:10 firestarman

build

firestarman avatar Oct 23 '24 02:10 firestarman

I spoke with @jlowe and I think we really want to understand this better. https://github.com/NVIDIA/spark-rapids/issues/11649

The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened.

revans2 avatar Oct 23 '24 15:10 revans2