CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

Simplify memory allocator with UVA

Open maleadt opened this issue 4 years ago • 3 comments

We currently keep track of which device owns each GPU allocation, but that's not necessary. Since CUDA 4 we have unified virtual addressing for sm_20+ devices on 64 bit, so we should use that to get rid of the PerDevice accounting wherever possible.

  • https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space
  • https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ (section 'Unified Memory or Unified Virtual Addressing')

PerDevice has other uses though, namely to kick allocations out of the pool when resetting the device. But since that's a rare operation, maybe we should do that in a slower way (e.g. by identifying devices using getPointerAttributes).

Some more details in this webinar: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf (@vchuravy suggests based on it that we should enable P2P transfers whenever possible)

Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some ptr not found in allocated errors @jpsamaroo was running into. Regardless of whether we decide this is the user's responsibility for now, we could use the above getPointerAttributes to check if the context matches in some debug mode.

maleadt avatar Nov 23 '20 17:11 maleadt

Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some ptr not found in allocated errors @jpsamaroo was running into.

julia> device!(0)

julia> a = CuArray([1])
1-element CuArray{Int64, 1}:
 1

julia> device!(1)

julia> b = CuArray([1])
1-element CuArray{Int64, 1}:
 1

julia> CUDA.unsafe_free!(a)
WARNING: Error while freeing CuPtr{Nothing}(0x00007fa9dac00000):
Base.KeyError(key=CUDA.CuPtr{Nothing}(0x00007fa9dac00000))

Stacktrace:
  [1] getindex
    @ ./dict.jl:482 [inlined]
  [2] free
    @ ~/Julia/pkg/CUDA/src/pool.jl:369 [inlined]
  [3] unsafe_free!(xs::CuArray{Int64, 1})
    @ CUDA ~/Julia/pkg/CUDA/src/array.jl:42
  [4] top-level scope
    @ REPL[5]:1
  [5] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
  [6] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [7] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [8] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
  [9] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
 [10] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [11] (::Base.var"#865#867"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [12] #invokelatest#2
    @ ./essentials.jl:707 [inlined]
 [13] invokelatest
    @ ./essentials.jl:706 [inlined]
 [14] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [15] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [16] _start()
    @ Base ./client.jl:485

julia> CUDA.unsafe_free!(b)

maleadt avatar Nov 24 '20 09:11 maleadt

You're confusing UVA with UVM. I'm not talking about unified memory here, so there's no performance impact.

maleadt avatar Dec 17 '20 07:12 maleadt

This would also allow us to just use cuMemcpyAsync instead of maintaining the different versions.

maleadt avatar Jan 17 '22 08:01 maleadt