Flux.jl
Flux.jl copied to clipboard
`Dropout` layer not working with CUDA
Whenever I try to train a model on GPU with a Dropout
layer the training fails and I get the error message pasted below.
At the beginning I thought it was a problem related to explicitly set a seed for random procedures (layer initialization, dataset splitting etc) but now I have completely removed any seed specification and the problem is still present. Then, I noticed that the problem emerge when using a DataLoader
during the training phase, I get stuck with the same error even when using something like rand(3, 10)
and rand(1, 10)
as the data and the label of the DataLoader respectively and with a simple model like m = Chain(Dense(3, 1), Dropout(0.2))
Is this a general problem with Dropout
+DataLoader
on GPU?
ERROR: InvalidIRError: compiling kernel rand!(CuDeviceMatrix{Float32, 1}, UInt32, UInt32) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to CUDA.Philox2x32{R}() where R in CUDA at /home/fabio/.julia/packages/CUDA/tTK8Y/src/device/random.jl:46)
Stacktrace:
[1] Philox2x32
@ ~/.julia/packages/CUDA/tTK8Y/src/device/random.jl:62
[2] #default_rng
@ ~/.julia/packages/CUDA/tTK8Y/src/device/random.jl:95
[3] kernel
@ ~/.julia/packages/CUDA/tTK8Y/src/random.jl:41
Reason: unsupported dynamic function invocation (call to rand(rng::Random.AbstractRNG, ::Type{X}) where X in Random at /usr/share/julia/stdlib/v1.7/Random/src/Random.jl:257)
Stacktrace:
[1] kernel
@ ~/.julia/packages/CUDA/tTK8Y/src/random.jl:53
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{CUDA.var"#kernel#320", Tuple{CuDeviceMatrix{Float32, 1}, UInt32, UInt32}}}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/validation.jl:139
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/iaKrd/src/driver.jl:414 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/iaKrd/src/driver.jl:412 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/utils.jl:64
[6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
@ CUDA ~/.julia/packages/CUDA/tTK8Y/src/compiler/execution.jl:354
[7] #224
@ ~/.julia/packages/CUDA/tTK8Y/src/compiler/execution.jl:347 [inlined]
[8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{CUDA.var"#kernel#320", Tuple{CuDeviceMatrix{Float32, 1}, UInt32, UInt32}}}})
@ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/driver.jl:74
[9] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/tTK8Y/src/compiler/execution.jl:346
[10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/iaKrd/src/cache.jl:90
[11] cufunction(f::CUDA.var"#kernel#320", tt::Type{Tuple{CuDeviceMatrix{Float32, 1}, UInt32, UInt32}}; name::String, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/tTK8Y/src/compiler/execution.jl:299
[12] macro expansion
@ ~/.julia/packages/CUDA/tTK8Y/src/compiler/execution.jl:102 [inlined]
[13] rand!(rng::CUDA.RNG, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ CUDA ~/.julia/packages/CUDA/tTK8Y/src/random.jl:62
[14] _dropout_mask(rng::CUDA.RNG, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, p::Float64; dims::Function)
@ Flux ~/.julia/packages/Flux/KkC79/src/layers/normalise.jl:45
[15] #dropout_mask#318
@ ~/.julia/packages/Flux/KkC79/src/layers/normalise.jl:39 [inlined]
[16] chain_rrule_kw
@ ~/.julia/packages/Zygote/IoW2g/src/compiler/chainrules.jl:229 [inlined]
[17] macro expansion
@ ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0 [inlined]
[18] _pullback
@ ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:9 [inlined]
[19] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/normalise.jl:34 [inlined]
[20] _pullback(::Zygote.Context, ::Flux.var"##dropout#316", ::Colon, ::Bool, ::typeof(Flux.dropout), ::CUDA.RNG, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Float64)
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[21] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/normalise.jl:33 [inlined]
[22] _pullback(::Zygote.Context, ::Flux.var"#dropout##kw", ::NamedTuple{(:dims, :active), Tuple{Colon, Bool}}, ::typeof(Flux.dropout), ::CUDA.RNG, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Float64)
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[23] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/normalise.jl:111 [inlined]
[24] _pullback(ctx::Zygote.Context, f::Dropout{Float64, Colon, CUDA.RNG}, args::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[25] macro expansion
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:53 [inlined]
[26] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:53 [inlined]
[27] _pullback(::Zygote.Context, ::typeof(Flux._applychain), ::Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(relu), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dropout{Float64, Colon, CUDA.RNG}}, ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[28] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:51 [inlined]
[29] macro expansion
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:53 [inlined]
[30] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:53 [inlined]
[31] _pullback(::Zygote.Context, ::typeof(Flux._applychain), ::Tuple{Chain{Tuple{EmbeddingLayer{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(relu), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dropout{Float64, Colon, CUDA.RNG}}}}, ::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[32] _pullback
@ ~/.julia/packages/Flux/KkC79/src/layers/basic.jl:51 [inlined]
[33] _pullback(ctx::Zygote.Context, f::Chain{Tuple{Chain{Tuple{EmbeddingLayer{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(relu), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dropout{Float64, Colon, CUDA.RNG}}}}}, args::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[34] _pullback
@ ./REPL[347]:26 [inlined]
[35] _pullback(ctx::Zygote.Context, f::var"#l#45"{Chain{Tuple{Chain{Tuple{EmbeddingLayer{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(relu), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dropout{Float64, Colon, CUDA.RNG}}}}}}, args::NamedTuple{(:data, :label), Tuple{CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}}})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[36] _apply
@ ./boot.jl:814 [inlined]
[37] adjoint
@ ~/.julia/packages/Zygote/IoW2g/src/lib/lib.jl:204 [inlined]
[38] _pullback
@ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
[39] _pullback
@ ~/.julia/packages/Flux/KkC79/src/optimise/train.jl:120 [inlined]
[40] _pullback(::Zygote.Context, ::Flux.Optimise.var"#37#40"{var"#l#45"{Chain{Tuple{Chain{Tuple{EmbeddingLayer{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(relu), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dropout{Float64, Colon, CUDA.RNG}}}}}}, NamedTuple{(:data, :label), Tuple{CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}}}})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface2.jl:0
[41] pullback(f::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface.jl:352
[42] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
@ Zygote ~/.julia/packages/Zygote/IoW2g/src/compiler/interface.jl:75
[43] macro expansion
@ ~/.julia/packages/Flux/KkC79/src/optimise/train.jl:119 [inlined]
[44] macro expansion
@ ~/.julia/packages/ProgressLogging/6KXlp/src/ProgressLogging.jl:328 [inlined]
[45] train!(loss::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, data::DataLoader{NamedTuple{(:data, :label), Tuple{CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}}}, Random.TaskLocalRNG, Val{nothing}}, opt::Adam; cb::Flux.var"#throttled#122"{Flux.var"#throttled#118#123"{Bool, Bool, var"#44#46"{CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}}, Int64}})
@ Flux.Optimise ~/.julia/packages/Flux/KkC79/src/optimise/train.jl:117
[46] train_autoencoder(train_set::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, test_set::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, max_features::Int64, vocab_size::Int64, pad_size::Int64, model_type::Int64; kws::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Main ./REPL[347]:33
[47] train_autoencoder(train_set::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, test_set::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}, max_features::Int64, vocab_size::Int64, pad_size::Int64, model_type::Int64)
@ Main ./REPL[347]:2
[48] top-level scope
@ REPL[348]:1
[49] top-level scope
@ ~/.julia/packages/CUDA/tTK8Y/src/initialization.jl:52
What version of CUDA.jl are you on and what CUDA toolkit? This looks like https://github.com/FluxML/Flux.jl/issues/2018.
Can we produce a MWE using only CUDA.jl?
Sorry, I wanted to leave CUDA and Flux versions but I forgot. Here they are:
-
CUDA v3.11.0
-
Flux v0.13.4
I also saw #2018 but the stack trace looked different so I though I stumbled on a different error or at least another corner case. If this is not the case, forgive me!
Ah, this looks like https://github.com/JuliaGPU/CUDA.jl/issues/1508. I believe Flux + CUDA should play well together, but if you're loading other packages, they may trigger the heuristic mentioned in that issue.