CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Excessive allocations when running on multiple threads

Open mfalt opened this issue 3 years ago • 5 comments

I'm trying to run a neutral network (using Flux) on a production server. I have no problems when running everything in one thread, but memory allocations on the GPU starts increasing when i put it behind a Mux.jl server and eventually I get a cuda out of memory error (or other cuda related error). This happens regardless if I try to run all the requests serially or in parallel, even with a lock around the gpu computations.

The following example runs without problems without Threads.@spawn, and never allocates above 1367MiB on the GPU. However, when the computations are run in a separate thread as in this example, it keeps allocating memory on every call until it eventually crashes somehow.

MWE:

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

imgs = cu(randn(28,28,3,1000))

N = 100
# Run this a few times
for j in 1:10
    res = Any[nothing for i in 1:N]
    for i in 1:N
        res[i] = fetch(Threads.@spawn inference(imgs))
    end
    s = sum(res)
    println(s)
end
julia> ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] fetch(t::Task)
   @ Base ./task.jl:349
 [3] top-level scope
   @ ~/cudatest/cuda2.jl:18

    nested task error: CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
    Stacktrace:
      [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:22
      [2] macro expansion
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/error.jl:35 [inlined]
      [3] cudnnCreate()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/base.jl:3
      [4] #1218
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:72 [inlined]
      [5] (::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})()
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:24
      [6] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUDNN.var"#1218#1225", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext}, l::ReentrantLock)
        @ Base ./lock.jl:190
      [7] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})(f::CUDA.CUDNN.var"#1218#1225")
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:22
      [8] pop!
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/cache.jl:46 [inlined]
      [9] (::CUDA.CUDNN.var"#new_state#1224")(cuda::NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:71
     [10] #1222
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:88 [inlined]
     [11] get!(default::CUDA.CUDNN.var"#1222#1229"{CUDA.CUDNN.var"#new_state#1224", NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}}}, h::Dict{CuContext, NamedTuple{(:handle, :stream), Tuple{Ptr{Nothing}, CuStream}}}, key::CuContext)
        @ Base ./dict.jl:464
     [12] handle()
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/CUDNN.jl:87
     [13] (::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:105
     [14] with_workspace(f::CUDA.CUDNN.var"#1145#1147"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnActivationMode_t, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, CUDA.CUDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
        @ CUDA.APIUtils ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:77
     [15] with_workspace
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:58 [inlined]
     [16] #with_workspace#1
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [17] with_workspace (repeats 2 times)
        @ ~/.julia/packages/CUDA/Axzxe/lib/utils/call.jl:53 [inlined]
     [18] cudnnConvolutionForwardAD(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::Nothing; y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, activation::CUDA.CUDNN.cudnnActivationMode_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, zDesc::Nothing, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:103
     [19] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::CUDA.CUDNN.cudnnConvolutionMode_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, group::Int64, format::CUDA.CUDNN.cudnnTensorFormat_t, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, wDesc::CUDA.CUDNN.cudnnFilterDescriptor, y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::Nothing, biasDesc::Nothing, zDesc::Nothing, activation::CUDA.CUDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:96
     [20] cudnnConvolutionForwardWithDefaults(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ CUDA.CUDNN ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:92
     [21] #cudnnConvolutionForward#1139
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [22] cudnnConvolutionForward
        @ ~/.julia/packages/CUDA/Axzxe/lib/cudnn/convolution.jl:50 [inlined]
     [23] inference(imgs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
        @ Main ~/cudatest/cuda2.jl:8
     [24] (::var"#11#12")()
        @ Main ./threadingconstructs.jl:178
in expression starting at /home/mattias/cudatest/cuda2.jl:15

Running on new project with only CUDA v3.8.3 as dependency.

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 1080 Ti (sm_61, 8.204 GiB / 10.913 GiB available)

Before running loop first time:

julia> CUDA.memory_status()
Effective GPU memory usage: 2.03% (226.625 MiB/10.913 GiB)
Memory pool usage: 8.974 MiB (32.000 MiB reserved)

After 1:

Effective GPU memory usage: 37.47% (4.090 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 2:

Effective GPU memory usage: 62.13% (6.780 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 3:

Effective GPU memory usage: 86.83% (9.476 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)

After 4 (and crash)

Effective GPU memory usage: 24.95% (2.723 GiB/10.913 GiB)
Memory pool usage: 22.157 MiB (64.000 MiB reserved)

mfalt avatar Mar 03 '22 13:03 mfalt

The problem definitely seems to be with how memory is created on threads. By running on the same set of threads every time, the problem seem to dissapear. I am not sure how useful this solution could be for my actual use case, but it might give some insight into the problem.

using CUDA

import CUDA.CUDNN: cudnnConvolutionForward

const W1 = cu(randn(5,5,3,6))

function inference(imgs)
    out = cudnnConvolutionForward(W1, imgs)
    return maximum(Array(out))
end

# Channel to put work on
const RecieverChannel = Channel{Tuple{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer},Channel{Float32}}}()
function listen_channel()
    while true
        imgs, res_channel = take!(RecieverChannel)
        res = inference(imgs)
        put!(res_channel, res)
    end
end
# Create N tasks to do the GPU computations
tsks = [Threads.@spawn listen_channel() for i in 1:2]

# Ask one of the Thread-workers to do the job
function inference_caller(imgs)
    res_channel = Channel{Float32}()
    put!(RecieverChannel, (imgs,res_channel))
    take!(res_channel)
end

imgs = cu(randn(28,28,3,1000))

N = 100
for k in 1:20
    for j in 1:10
        res = Any[nothing for i in 1:N]
        for i in 1:N # We spawn a lot of work
            res[i] = Threads.@spawn inference_caller(imgs)
        end
        s = sum(fetch.(res))
        println(s)
    end
end

mfalt avatar Mar 03 '22 16:03 mfalt

Hmm, I cannot reproduce. On Linux, CUDA.jl#master, Julia 1.7.2 with 32 threads, trying your original example:

julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 12.74% (1.880 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.13% (1.937 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.40% (1.976 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

julia> CUDA.memory_status()
Effective GPU memory usage: 13.52% (1.995 GiB/14.751 GiB)
Memory pool usage: 22.157 MiB (928.000 MiB reserved)
julia> main()
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488
4641.488

Running it in the REPL, i.e. not in a main() function, doesn't seem to matter. Can you try CUDA.jl#master?

maleadt avatar Mar 04 '22 11:03 maleadt

The problem was not solved by changing CUDA.jl versions or changing system drivers. However, the problem does not seem to be reproducible on other machines. I tried myself on AWS without any problems.

mfalt avatar Mar 08 '22 09:03 mfalt

Very strange. Using the same Julia binaries everywhere? Number of threads Julia was launched with?

maleadt avatar Mar 08 '22 10:03 maleadt

Yes, I have tried several different configurations but I am only able to reproduce it on that machine. Not sure if it could be specific to the gpu model or even related the the specific hardware.

mfalt avatar Mar 08 '22 10:03 mfalt

It's been a while since there's been activity here, so I'm going to close this. If this still happens on latest master, don't hesitate to open a new issue with an updated MWE.

maleadt avatar Apr 26 '24 18:04 maleadt