CUDA.jl
CUDA.jl copied to clipboard
Option for filtering by CUDNN_DETERMINISTIC in `cudnnConvolutionAlgoPerfChoose`
Is your feature request related to a problem? Please describe.
Following from https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092, there is no way to guarantee for users of e.g. NNlibCUDA to ensure convolution operations only use deterministic algorithms
Describe the solution you'd like
Something along the lines of https://github.com/pytorch/pytorch/blob/6c70cbedb6102da08fe91186d40a41b50991681d/aten/src/ATen/native/cudnn/Conv_v7.cpp#L219-L252. Whether this would need to be plumbed through higher-level functions, set as a global option or exposed through a context manager is left for debate.
Describe alternatives you've considered
The only solution now is seems to be pirating cudnnConvolutionForwardAD
such that it doesn't use cudnnConvolutionFwdAlgoPerf
?
Additional context
https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#reproducibility states:
...the following routines do not guarantee reproducibility because they use atomic operations:
- cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
- cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used
So I have no clue if/how often these show up in practice in cudnnConvolution*AlgoPerf
, but I assume they must if users are seeing non-deterministic results?
I saw this has a "good first issue" tag- any hints as to how to do this?
I haven't put much thought in it; a good first step would be to have a good look at what other libraries/frameworks do (although there's a couple of references in the OP already). We also already have a math_mode
, so maybe the deterministic mode should piggy-back on the pedantic math mode: https://github.com/JuliaGPU/CUDA.jl/blob/056a5269e41d3f447e957b4058bc424683bbbd09/lib/cudnn/CUDNN.jl#L43-L55=. If that doesn't work, a separate determinism TLS entry seems like the best way to go (similar to how math_mode
is implemented).
This is still an issue that seems to be easily fixable once there is agreement on how to implement a deterministic
option/flag. I wonder however, if
- the deterministic behavior should maybe be the default one? I find it super uninituitve that I use
CUDA.seed!
and still get non-reproducible results (maybeCUDA.seed!
should implicitly set thedeterministic
flag?), - setting the
deterministic
flag should invalide the cache (I guess?), and - there are situations where no deterministic algoritm is available.
CUDA.seed!
seeds the RNG, which is unrelated to CUDNN, so I don't see why we should couple those.
Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default. At the very least, there should be a clear warning in the docs (but more suited for Flux.jl
, I guess).
Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default.
The missing context here is that deep learning in general has a much looser notion of "reproducibility" than most other scientific computing domains. This is due to many reasons and could be an entire discussion on its own, but suffice it to say that the slight deviances caused by allowing non-deterministic algorithms are not sufficient to get most all deep learning code to stop using them. Indeed, none of PyTorch/TF/JAX filter to only deterministic algorithms by default, and Flux is no exception.
It's just that from a scientific computing language, I kina expect reproducibility by default.
We've had this discussion in the past already. Using Julia doesn't guarantee anything, it all depends on the specific domain. In the case of GPU computing, sacrificing some accuracy is the norm just because of how the devices work (threads executing in a non-deterministic fashion, resulting in inaccuracies due to floating-point semantics). We are not going to degrade performance if none of the other software packages in this specific domain (of GPU programming, deep learning, etc) behave like that.
Back to the main issue, the main unanswered question is how the layering works here with so many libraries in the stack. Flux probably shouldn't be involved because we'd like it to be accelerator agnostic.
So that leaves CUDA, NNlib and NNlibCUDA. I think a global flag in CUDA.jl might be awkward because, unlike MATH_MODE, algorithm filtering doesn't neatly map to a single cu*Set*Mode
call. If cudnnConvolutionForward!
and friends could take either an algorithm override or callback to filter algorithms, NNlibCUDA could handle most of this logic. Then NNlib(CUDA) can choose to add global functions to toggle deterministic conv algorithms (i.e. flip a Ref value and and check that in conv!
) at its leisure.
I noticed that sparse matrix-vector multiplications are also not reproducible. The matrix-vector multiplications yield a different result whenever they are executed:
using CUDA
using Random
Random.seed!(546784)
A = cu(sprand(Float64, 1000, 1000, 0.6))
x = cu(rand(Float64, 1000))
y1 = A * x
y2 = A * x
maximum(abs.(y2 - y1))