CUDA.jl Excessive allocations when running on multiple threads

I'm trying to run a neutral network (using Flux) on a production server. I have no problems when running everything in one thread, but memory allocations on the GPU starts increasing when i put it behind a Mux.jl server and eventually I get a cuda out of memory error (or other cuda related error). This happens regardless if I try to run all the requests serially or in parallel, even with a lock around the gpu computations.

The following example runs without problems without Threads.@spawn, and never allocates above 1367MiB on the GPU. However, when the computations are run in a separate thread as in this example, it keeps allocating memory on every call until it eventually crashes somehow.

MWE:

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

imgs = cu(randn(28,28,3,1000))

N = 100
# Run this a few times
for j in 1:10
    res = Any[nothing for i in 1:N]
    for i in 1:N
        res[i] = fetch(Threads.@spawn inference(imgs))
    end
    s = sum(res)
    println(s)
end

julia> ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:349
 [3] top-level scope
   @ ~/cudatest/cuda2.jl:18

    nested task error: CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:22
      [2] macro expansion
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:35 [inlined]
      [3] cudnnCreate()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/base.jl:3
      [4] #1218
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:72 [inlined]
      [5] (::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})()
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:24
      [6] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext}, l::ReentrantLock)
        @ Base ./lock.jl:190
      [7] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})(f::CUDA.CUDNN.var"#1218#1225")
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:22
      [8] pop!
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:46 [inlined]
      [9] (::CUDA.CUDNN.var"#new_state#1224")(cuda::NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:71
     [10] #1222
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:88 [inlined]
     [11] get!(default::CUDA.CUDNN.var"#1222#1229"{CUDA.CUDNN.var"#new_state#1224", NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}}}, h::Dict{CuContext, NamedTuple{(:handle, :stream), Tuple{Ptr{Nothing}, CuStream}}}, key::CuContext)
        @ Base ./dict.jl:464
     [12] handle()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:87
     [13] (::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:105
     [14] with_workspace(f::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:77
     [15] with_workspace
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:58 [inlined]
     [16] #with_workspace#1
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [17] with_workspace (repeats 2 times)
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [18] cudnnConvolutionForwardAD(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::Nothing; y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, activation::CUDA.CUDNN.cudnnActivationMode_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, zDesc::Nothing, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:103
     [19] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::CUDA.CUDNN.cudnnConvolutionMode_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, group::Int64, format::CUDA.CUDNN.cudnnTensorFormat_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::Nothing, biasDesc::Nothing, zDesc::Nothing, activation::CUDA.CUDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:96
     [20] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:92
     [21] #cudnnConvolutionForward#1139
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [22] cudnnConvolutionForward
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [23] inference(imgs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ Main ~/cudatest/cuda2.jl:8
     [24] (::var"#11#12")()
        @ Main ./threadingconstructs.jl:178
in expression starting at /home/mattias/cudatest/cuda2.jl:15

Running on new project with only CUDA v3.8.3 as dependency.

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 8.204 GiB / 10.913 GiB available)

Before running loop first time:

julia> CUDA.memory_status()
Effective GPU memory usage: 2.03% (226.625 MiB/10.913 GiB)
Memory pool usage: 8.974 MiB (32.000 MiB reserved)

After 1:

Effective GPU memory usage: 37.47% (4.090 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 2:

Effective GPU memory usage: 62.13% (6.780 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 3:

Effective GPU memory usage: 86.83% (9.476 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 4 (and crash)

Effective GPU memory usage: 24.95% (2.723 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (64.000 MiB reserved)

Mar 03 '22 13:03 mfalt

The problem definitely seems to be with how memory is created on threads. By running on the same set of threads every time, the problem seem to dissapear. I am not sure how useful this solution could be for my actual use case, but it might give some insight into the problem.

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

# Channel to put work on
const RecieverChannel = Channel{Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer},Channel{Float32}}}()
function listen_channel()
    while true
        imgs, res_channel = take!(RecieverChannel)
        res = inference(imgs)
        put!(res_channel, res)
    end
end
# Create N tasks to do the GPU computations
tsks = [Threads.@spawn listen_channel() for i in 1:2]

# Ask one of the Thread-workers to do the job
function inference_caller(imgs)
    res_channel = Channel{Float32}()
    put!(RecieverChannel, (imgs,res_channel))
    take!(res_channel)
end

imgs = cu(randn(28,28,3,1000))

N = 100
for k in 1:20
    for j in 1:10
        res = Any[nothing for i in 1:N]
        for i in 1:N # We spawn a lot of work
            res[i] = Threads.@spawn inference_caller(imgs)
        end
        s = sum(fetch.(res))
        println(s)
    end
end

Mar 03 '22 16:03 mfalt

Hmm, I cannot reproduce. On Linux, CUDA.jl#master, Julia 1.7.2 with 32 threads, trying your original example:

julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 12.74% (1.880 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.13% (1.937 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.40% (1.976 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.52% (1.995 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

Running it in the REPL, i.e. not in a main() function, doesn't seem to matter. Can you try CUDA.jl#master?

Mar 04 '22 11:03 maleadt

The problem was not solved by changing CUDA.jl versions or changing system drivers. However, the problem does not seem to be reproducible on other machines. I tried myself on AWS without any problems.

Mar 08 '22 09:03 mfalt

Very strange. Using the same Julia binaries everywhere? Number of threads Julia was launched with?

Mar 08 '22 10:03 maleadt

Yes, I have tried several different configurations but I am only able to reproduce it on that machine. Not sure if it could be specific to the gpu model or even related the the specific hardware.

Mar 08 '22 10:03 mfalt

It's been a while since there's been activity here, so I'm going to close this. If this still happens on latest master, don't hesitate to open a new issue with an updated MWE.

Apr 26 '24 18:04 maleadt