LoadError: context should be active
Hello, occasionally I get this error whenever using CUDA.
Training a chain quantizer
-2 2.394080e+04... 0.24 secs updating C
ERROR: LoadError: context should be active
[1] error(::String) at ./error.jl:33
[2] device at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/context.jl:165 [inlined]
[3] (::getfield(CuArrays.CUBLAS, Symbol("##3#5")))() at /home/xhanko1/.julia/packages/CuArrays/PD3UJ/src/blas/CUBLAS.jl:25
[4] get!(::getfield(CuArrays.CUBLAS, Symbol("##3#5")), ::Dict{CUDAdrv.CuContext,Ptr{Nothing}}, ::CUDAdrv.CuContext) at ./dict.jl:453
[5] handle at /home/xhanko1/.julia/packages/CuArrays/PD3UJ/src/blas/CUBLAS.jl:20 [inlined]
[6] macro expansion at /home/xhanko1/.julia/packages/CuArrays/PD3UJ/src/blas/error.jl:43 [inlined]
[7] gemm!(::Char, ::Char, ::Float32, ::CuArrays.CuArray{Float32,2}, ::CuArrays.CuArray{Float32,2}, ::Float32, ::CuArrays.CuArray{Float32,2}) at /home/xhanko1/.julia/packages/CuArrays/PD3UJ/src/blas/wrappers.jl:888
[8] gemm at /home/xhanko1/.julia/packages/CuArrays/PD3UJ/src/blas/wrappers.jl:903 [inlined]
[9] quantize_chainq_cuda!(::Array{Int16,2}, ::Array{Float32,2}, ::Array{Array{Float32,2},1}, ::Array{Array{Float32,2},1}, ::UnitRange{Int64}) at /home/xhanko1/.julia/dev/Rayuela/src/ChainQ.jl:239
[10] quantize_chainq(::Array{Float32,2}, ::Array{Array{Float32,2},1}, ::Bool, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/ChainQ.jl:325
[11] train_chainq(::Array{Float32,2}, ::Int64, ::Int64, ::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Int64, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/ChainQ.jl:401
[12] run_demos(::String, ::Int64, ::Int64, ::Int64, ::Int64) at /home/xhanko1/.julia/dev/Rayuela/demos/demos_train_query_base.jl:57
[13] top-level scope at /home/xhanko1/.julia/dev/Rayuela/demos/demos_train_query_base.jl:171 [inlined]
[14] top-level scope at ./none:0
[15] include at ./boot.jl:326 [inlined]
[16] include_relative(::Module, ::String) at ./loading.jl:1038
[17] include(::Module, ::String) at ./sysimg.jl:29
[18] include(::String) at ./client.jl:403
[19] top-level scope at none:0
in expression starting at /home/xhanko1/.julia/dev/Rayuela/demos/demos_train_query_base.jl:170
What is weird is that sometimes it lets me train both ChainQ and LSQ, sometimes I get this error. Does anyone have any pointers what could possibly be the error?
Oh jeez, it seems like the cuda context is getting garbage collected or something.
To be honest, the CUDA ecosystem in julia was quite unstable back then, and I had to hack a bunch of things to make it work. Could you please share your OS, julia version, command you ran, and other details that could help me reproduce this issue on my end?
Sure, OS is Red Hat Enterprise Linux 8.6 and Kernel is Linux 4.18.0-372.19.1.el8_6.x86_64. I am using Julia 1.1.1.
There was this wonky behavior when building Rayuela for the first time where some of the libraries' versions didn't match the versions in Manifest.toml so I am providing the current versions as well:
Installed Requires ───────────── v0.5.2
Installed Adapt ──────────────── v0.4.2
Installed Rmath ──────────────── v0.6.0
Installed AbstractFFTs ───────── v0.4.1
Installed NaNMath ────────────── v0.3.7
Installed HDF5 ───────────────── v0.12.5
Installed QuadGK ─────────────── v2.5.0
Installed JSON ───────────────── v0.21.3
Installed StatsAPI ───────────── v1.5.0
Installed CommonSubexpressions ─ v0.3.0
Installed DataAPI ────────────── v1.10.0
Installed FFTW ───────────────── v0.3.0
Installed GPUArrays ──────────── v0.6.1
Installed CMakeWrapper ───────── v0.2.4
Installed BinDeps ────────────── v1.0.2
Installed Arpack ─────────────── v0.3.2
Installed DataStructures ─────── v0.17.20
Installed Distributions ──────── v0.21.9
Installed NearestNeighbors ───── v0.4.11
Installed NNlib ──────────────── v0.5.0
Installed Distances ──────────── v0.10.7
Installed DiffResults ────────── v1.0.3
Installed MacroTools ─────────── v0.5.9
Installed BinaryProvider ─────── v0.5.10
Installed StaticArrays ───────── v0.12.5
Installed ForwardDiff ────────── v0.10.18
Installed Missings ───────────── v0.4.5
Installed SortingAlgorithms ──── v0.3.1
Installed CMake ──────────────── v1.2.0
Installed URIParser ──────────── v0.4.1
Installed UnPack ─────────────── v1.0.2
Installed RecipesBase ────────── v1.2.1
Installed CUDAdrv ────────────── v1.0.1
Installed CUDAnative ─────────── v1.0.1
Installed PDMats ─────────────── v0.9.12
Installed FillArrays ─────────── v0.5.0
Installed Parsers ────────────── v2.4.0
Installed StatsFuns ──────────── v0.9.8
Installed Compat ─────────────── v2.2.1
Installed VersionParsing ─────── v1.3.0
Installed Clustering ─────────── v0.14.2
Installed CuArrays ───────────── v0.9.1
Installed LLVM ───────────────── v1.1.0
Installed Parameters ─────────── v0.12.3
Installed Reexport ───────────── v0.2.0
Installed CUDAapi ────────────── v0.6.3
Installed Blosc ──────────────── v0.5.1
Installed SpecialFunctions ───── v0.8.0
Installed LogExpFunctions ────── v0.2.5
Installed DocStringExtensions ── v0.8.6
Installed IterativeSolvers ───── v0.8.5
Installed DiffRules ──────────── v0.1.0
Installed Conda ──────────────── v1.5.2
Installed OrderedCollections ─── v1.4.1
Installed StatsBase ──────────── v0.32.2
I am running demos_train_query_base.jl in Julia REPL using include(...). I have commented out lines 29 through 48 and ran the program. I have applied the fix from the other issue, otherwise I get an error much sooner. On top of that, I am running @time in OPQ.jl:186.
What is annoying is that sometimes it lets me through both ChainQ and LSQ training and sometimes it crashes with this error. Seemingly nondeterministically. Weird.
I got one more CUDA-related error on a custom dataset (100k x 4096) which I unfortunately cannot share so I will understand if you are unable to help here.
Running CUDA LSQ training...
Training LSQ GPU with 7 codebooks, 4 perturbations, 4 icm iterations and random order = true
Doing fast bin codebook update... done in 1.438 seconds.
-2 1.823259e+03
Creating 50000 random states... done in 0.02 seconds
ERROR: LoadError: CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE)
[1] macro expansion at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/base.jl:147 [inlined]
[2] macro expansion at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:90 [inlined]
[3] macro expansion at ./gcutils.jl:87 [inlined]
[4] macro expansion at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:88 [inlined]
[5] _launch at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:68 [inlined]
[6] launch at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:60 [inlined]
[7] macro expansion at ./gcutils.jl:87 [inlined]
[8] macro expansion at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:171 [inlined]
[9] #_cudacall#24(::Int64, ::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type{Tuple{Ptr{Float32},Ptr{Float32},Ptr{UInt8},Ptr{Float32},Int32,Int32,Int32}}, ::Tuple{CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,Int32,Int32,Int32}) at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:154
[10] (::getfield(CUDAdrv, Symbol("#kw##_cudacall")))(::NamedTuple{(:blocks, :threads, :shmem),Tuple{Int64,Tuple{Int64,Int64},Int64}}, ::typeof(CUDAdrv._cudacall), ::CUDAdrv.CuFunction, ::Type, ::Tuple{CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,CUDAdrv.Mem.Buffer,Int32,Int32,Int32}) at ./none:0
[11] #cudacall#22 at /home/xhanko1/.julia/packages/CUDAdrv/JWljj/src/execution.jl:139 [inlined]
[12] (::getfield(CUDAdrv, Symbol("#kw##cudacall")))(::NamedTuple{(:blocks, :threads, :shmem),Tuple{Int64,Tuple{Int64,Int64},Int64}}, ::typeof(CUDAdrv.cudacall), ::CUDAdrv.CuFunction, ::NTuple{7,DataType}, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::Int32, ::Int32, ::Int32) at ./none:0
[13] veccost2(::Int64, ::Tuple{Int64,Int64}, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer, ::Int32, ::Int32, ::Int32) at /home/xhanko1/.julia/dev/Rayuela/src/CudaUtilsModule.jl:106
[14] encode_icm_cuda_single(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/LSQ_GPU.jl:116
[15] encode_icm_cuda(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Int64, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/LSQ_GPU.jl:249
[16] train_lsq_cuda(::Array{Float32,2}, ::Int64, ::Int64, ::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Int64, ::Int64, ::Int64, ::Bool, ::Int64, ::Int64, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/LSQ_GPU.jl:300
[17] experiment_lsq_cuda(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Float32,2}, ::Array{Float32,2}, ::Array{Float32,2}, ::Array{UInt32,1}, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Bool, ::Int64, ::Int64, ::Int64, ::Int64, ::Bool) at /home/xhanko1/.julia/dev/Rayuela/src/LSQ_GPU.jl:345
[18] run_demos(::String, ::Int64, ::Int64, ::Int64, ::Int64) at /home/xhanko1/.julia/dev/Rayuela/demos/demo_profiset.jl:70
[19] top-level scope at /home/xhanko1/.julia/dev/Rayuela/demos/demo_profiset.jl:98 [inlined]
[20] top-level scope at ./none:0
[21] include at ./boot.jl:326 [inlined]
[22] include_relative(::Module, ::String) at ./loading.jl:1038
[23] include(::Module, ::String) at ./sysimg.jl:29
[24] include(::String) at ./client.jl:403
[25] top-level scope at none:0
in expression starting at /home/xhanko1/.julia/dev/Rayuela/demos/demo_profiset.jl:97
This never happens on SIFT1M where if I don't get the context error everything runs fine. Do you have any idea what could be the issue here? I did successfully run the previous methods (PQ, OPQ, RVQ, ERVQ) albeit it took much more time than in case of SIFT1M which makes me believe that the LSQ implementation cannot handle data of this dimensionality? Could my assumption be correct?
Regarding the last comment, the CUDA kernels have some hardcoded values they expect in eg data dimensionality. You kind of have to do that if you want to squeeze the last bits of performance..., so that could be the issue, yes.