CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Option for filtering by CUDNN_DETERMINISTIC in `cudnnConvolutionAlgoPerfChoose`

Open ToucheSir opened this issue 3 years ago • 9 comments

Is your feature request related to a problem? Please describe.

Following from https://discourse.julialang.org/t/flux-reproducibility-of-gpu-experiments/62092, there is no way to guarantee for users of e.g. NNlibCUDA to ensure convolution operations only use deterministic algorithms

Describe the solution you'd like

Something along the lines of https://github.com/pytorch/pytorch/blob/6c70cbedb6102da08fe91186d40a41b50991681d/aten/src/ATen/native/cudnn/Conv_v7.cpp#L219-L252. Whether this would need to be plumbed through higher-level functions, set as a global option or exposed through a context manager is left for debate.

Describe alternatives you've considered

The only solution now is seems to be pirating cudnnConvolutionForwardAD such that it doesn't use cudnnConvolutionFwdAlgoPerf?

Additional context

https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#reproducibility states:

...the following routines do not guarantee reproducibility because they use atomic operations:

  • cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
  • cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used

So I have no clue if/how often these show up in practice in cudnnConvolution*AlgoPerf, but I assume they must if users are seeing non-deterministic results?

ToucheSir avatar May 30 '21 17:05 ToucheSir

I saw this has a "good first issue" tag- any hints as to how to do this?

ericphanson avatar May 17 '22 19:05 ericphanson

I haven't put much thought in it; a good first step would be to have a good look at what other libraries/frameworks do (although there's a couple of references in the OP already). We also already have a math_mode, so maybe the deterministic mode should piggy-back on the pedantic math mode: https://github.com/JuliaGPU/CUDA.jl/blob/056a5269e41d3f447e957b4058bc424683bbbd09/lib/cudnn/CUDNN.jl#L43-L55=. If that doesn't work, a separate determinism TLS entry seems like the best way to go (similar to how math_mode is implemented).

maleadt avatar May 17 '22 21:05 maleadt

This is still an issue that seems to be easily fixable once there is agreement on how to implement a deterministicoption/flag. I wonder however, if

  • the deterministic behavior should maybe be the default one? I find it super uninituitve that I use CUDA.seed! and still get non-reproducible results (maybe CUDA.seed! should implicitly set the deterministic flag?),
  • setting the deterministic flag should invalide the cache (I guess?), and
  • there are situations where no deterministic algoritm is available.

marcpabst avatar Sep 10 '22 09:09 marcpabst

CUDA.seed! seeds the RNG, which is unrelated to CUDNN, so I don't see why we should couple those.

maleadt avatar Sep 10 '22 10:09 maleadt

Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default. At the very least, there should be a clear warning in the docs (but more suited for Flux.jl, I guess).

marcpabst avatar Sep 10 '22 11:09 marcpabst

Yeah, I agree. It's just that from a scientific computing language, I kina expect reproducibility by default.

The missing context here is that deep learning in general has a much looser notion of "reproducibility" than most other scientific computing domains. This is due to many reasons and could be an entire discussion on its own, but suffice it to say that the slight deviances caused by allowing non-deterministic algorithms are not sufficient to get most all deep learning code to stop using them. Indeed, none of PyTorch/TF/JAX filter to only deterministic algorithms by default, and Flux is no exception.

ToucheSir avatar Sep 11 '22 19:09 ToucheSir

It's just that from a scientific computing language, I kina expect reproducibility by default.

We've had this discussion in the past already. Using Julia doesn't guarantee anything, it all depends on the specific domain. In the case of GPU computing, sacrificing some accuracy is the norm just because of how the devices work (threads executing in a non-deterministic fashion, resulting in inaccuracies due to floating-point semantics). We are not going to degrade performance if none of the other software packages in this specific domain (of GPU programming, deep learning, etc) behave like that.

maleadt avatar Sep 11 '22 20:09 maleadt

Back to the main issue, the main unanswered question is how the layering works here with so many libraries in the stack. Flux probably shouldn't be involved because we'd like it to be accelerator agnostic.

So that leaves CUDA, NNlib and NNlibCUDA. I think a global flag in CUDA.jl might be awkward because, unlike MATH_MODE, algorithm filtering doesn't neatly map to a single cu*Set*Mode call. If cudnnConvolutionForward! and friends could take either an algorithm override or callback to filter algorithms, NNlibCUDA could handle most of this logic. Then NNlib(CUDA) can choose to add global functions to toggle deterministic conv algorithms (i.e. flip a Ref value and and check that in conv!) at its leisure.

ToucheSir avatar Sep 11 '22 20:09 ToucheSir

I noticed that sparse matrix-vector multiplications are also not reproducible. The matrix-vector multiplications yield a different result whenever they are executed:

using CUDA
using Random

Random.seed!(546784)

A = cu(sprand(Float64, 1000, 1000, 0.6))
x = cu(rand(Float64, 1000))

y1 = A * x
y2 = A * x

maximum(abs.(y2 - y1))

renatobellotti avatar Sep 15 '22 12:09 renatobellotti