spark-rapids
spark-rapids copied to clipboard
Fix a NPE issue in GpuRand
close https://github.com/NVIDIA/spark-rapids/issues/11646
curXORShiftRandomSeed is marked as transient, so it will be null on executors without retry-restore context, leading to this NPE.
This fix removes the transient for curXORShiftRandomSeed, seed and previousPartition that will be used by the computation on executors.
I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.
build
I spoke with @jlowe and I think we really want to understand this better. https://github.com/NVIDIA/spark-rapids/issues/11649
The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened.