CUDA.jl Simplify memory allocator with UVA

We currently keep track of which device owns each GPU allocation, but that's not necessary. Since CUDA 4 we have unified virtual addressing for sm_20+ devices on 64 bit, so we should use that to get rid of the PerDevice accounting wherever possible.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space
https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ (section 'Unified Memory or Unified Virtual Addressing')

PerDevice has other uses though, namely to kick allocations out of the pool when resetting the device. But since that's a rare operation, maybe we should do that in a slower way (e.g. by identifying devices using getPointerAttributes).

Some more details in this webinar: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf (@vchuravy suggests based on it that we should enable P2P transfers whenever possible)

Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some ptr not found in allocated errors @jpsamaroo was running into. Regardless of whether we decide this is the user's responsibility for now, we could use the above getPointerAttributes to check if the context matches in some debug mode.

Nov 23 '20 17:11 maleadt

Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some ptr not found in allocated errors @jpsamaroo was running into.

julia> device!(0)

julia> a = CuArray([1])
1-element CuArray{Int64, 1}:
 1

julia> device!(1)

julia> b = CuArray([1])
1-element CuArray{Int64, 1}:
 1

julia> CUDA.unsafe_free!(a)
WARNING: Error while freeing CuPtr{Nothing}(0x00007fa9dac00000):
Base.KeyError(key=CUDA.CuPtr{Nothing}(0x00007fa9dac00000))

Stacktrace:
  [1] getindex
    @ ./dict.jl:482 [inlined]
  [2] free
    @ ~/Julia/pkg/CUDA/src/pool.jl:369 [inlined]
  [3] unsafe_free!(xs::CuArray{Int64, 1})
    @ CUDA ~/Julia/pkg/CUDA/src/array.jl:42
  [4] top-level scope
    @ REPL[5]:1
  [5] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
  [6] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [7] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [8] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
  [9] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
 [10] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [11] (::Base.var"#865#867"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [12] #invokelatest#2
    @ ./essentials.jl:707 [inlined]
 [13] invokelatest
    @ ./essentials.jl:706 [inlined]
 [14] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [15] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [16] _start()
    @ Base ./client.jl:485

julia> CUDA.unsafe_free!(b)

Nov 24 '20 09:11 maleadt

You're confusing UVA with UVM. I'm not talking about unified memory here, so there's no performance impact.

Dec 17 '20 07:12 maleadt

This would also allow us to just use cuMemcpyAsync instead of maintaining the different versions.

Jan 17 '22 08:01 maleadt

CUDA.jl CUDA.jl copied to clipboard

Simplify memory allocator with UVA

CUDA.jl
CUDA.jl copied to clipboard