RRTMGP.jl
RRTMGP.jl copied to clipboard
Pass a random seed number in the function that generates random numbers
Just to add some details / context to this issue:
RRTMGP is not reproducible w.r.t. its random number generation for unthreaded runs because of this line: https://github.com/CliMA/RRTMGP.jl/blob/89db03b89a5b18fc24d382a81fbf06a628014ae4/src/optics/CloudOptics.jl#L215. For threaded runs, different threads will call rand!
, impacting the global state shared between threads, and since the order of columns that each thread works on is non-deterministic, neither is the sampled random numbers. To make this reproducible for threaded runs, we'll need to pass in a seed per column.
More generally, If would be very useful if we could support reconstructing precisely the state of the random number generator upon restarts so that the stream of random number is the same as if we didn't restart the simulation. If this is not possible, we won't be able to use restarts to debug broken builds
More generally, If would be very useful if we could support reconstructing precisely the state of the random number generator upon restarts so that the stream of random number is the same as if we didn't restart the simulation. If this is not possible, we won't be able to use restarts to debug broken builds
For this, @sriharshakandala mentioned we will need to store the random number, which will increase the memory. Would that be ok?
Storing the random number will increase the memory footprint by about 2 to 3 orders of magnitude. We can pass in seed for each column if this helps! Is this preferable?
2 orders of magnitude sounds large and I would rather avoid that. What do others think?
Our goal is to be able to run two identical runs. This requires thread-safety (different threads not changing each other RNG state) so the first step would be to understand the CUDA RNG scheme.
This conversation seems to indicate that RNG is warp-safe out-of-the-box https://discourse.julialang.org/t/kernel-random-numbers-generation-entropy-randomness-issues/105637
We use overlay method tables during GPU compilation to replace Random.default_rng() to a custom, GPU-friendly RNG: https://github.com/JuliaGPU/CUDA.jl/blob/2ae53761a6a254b98a6689ed0d39781176b245cf/src/device/random.jl#L97 5. Similarly, just calling rand() in a kernel just works and uses the correct RNG.
Specifically, we use Philox2x32, Switch to Philox2x32 for device-side RNG by maleadt · Pull Request #882 · JuliaGPU/CUDA.jl · GitHub 3, a counter-based PRNG. The seed is passed from the host, and the counters are maintained per-warp and initialized at the start of each kernel that uses the RNG, rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub 1. The implementation isn’t fully generic, e.g. you can’t have multiple RNG objects, but it’s pretty close to how Random.jl works.
We should understand this. Maybe all we have to do is worry about warp vs thread.
Second, it would be good to be able to save the RNG state and recover it so that we can support restarts. The details of this will depend on the RNG used.
@sriharshakandala Let's fix the reproducibility issue when running two identical runs first, which shouldn't require storing the random numbers. We can talk about restarts after the first issue is fixed.
From the conversation, it looks like passing in a single seed might work! Though, this could always differ from the results from the CPU simulation.
maleadt Regular
danielwe Nov 2023 I think it can be different per warp, but IIRC (it’s been a while since I wrote that code) the idea was to use a single seed for all warps, as we offset it using a counter that’s based on the global ID of the thread. That’s also what happens by default: a single seed is passed from the host and applied from every thread.
I just discussed this with Sriharsha. We will modify the code to ensure reproducibility when running two identical runs, without worrying about the restart. After that is done we can explore whether it is feasible to support restart without increasing the memory footprint by too much. @Sbozzolo What do you think?
I just discussed this with Sriharsha. We will modify the code to ensure reproducibility when running two identical runs, without worrying about the restart. After that is done we can explore whether it is feasible to support restart without increasing the memory footprint by too much. @Sbozzolo What do you think?
Yes, this is a good start, but I would like us to think about supporting restarts as well.
I don't think it makes sense for the memory footprint to increase by orders of magntitude: even if we saved one element per point on the domain we would only the same size as as any other 3D variable. Also, the state has to be saved only when we produce a checkpoint.