CUDA.jl
                                
                                
                                
                                    CUDA.jl copied to clipboard
                            
                            
                            
                        Simplify memory allocator with UVA
We currently keep track of which device owns each GPU allocation, but that's not necessary. Since CUDA 4 we have unified virtual addressing for sm_20+ devices on 64 bit, so we should use that to get rid of the PerDevice accounting wherever possible.
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space
 - https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ (section 'Unified Memory or Unified Virtual Addressing')
 
PerDevice has other uses though, namely to kick allocations out of the pool when resetting the device. But since that's a rare operation, maybe we should do that in a slower way (e.g. by identifying devices using getPointerAttributes).
Some more details in this webinar: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf (@vchuravy suggests based on it that we should enable P2P transfers whenever possible)
Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some ptr not found in allocated errors @jpsamaroo was running into. Regardless of whether we decide this is the user's responsibility for now, we could use the above getPointerAttributes to check if the context matches in some debug mode.
Finally, looking at the code I noticed that we maybe have to switch contexts to free allocations, and this may be the cause for some
ptr not found in allocatederrors @jpsamaroo was running into.
julia> device!(0)
julia> a = CuArray([1])
1-element CuArray{Int64, 1}:
 1
julia> device!(1)
julia> b = CuArray([1])
1-element CuArray{Int64, 1}:
 1
julia> CUDA.unsafe_free!(a)
WARNING: Error while freeing CuPtr{Nothing}(0x00007fa9dac00000):
Base.KeyError(key=CUDA.CuPtr{Nothing}(0x00007fa9dac00000))
Stacktrace:
  [1] getindex
    @ ./dict.jl:482 [inlined]
  [2] free
    @ ~/Julia/pkg/CUDA/src/pool.jl:369 [inlined]
  [3] unsafe_free!(xs::CuArray{Int64, 1})
    @ CUDA ~/Julia/pkg/CUDA/src/array.jl:42
  [4] top-level scope
    @ REPL[5]:1
  [5] eval(m::Module, e::Any)
    @ Core ./boot.jl:360
  [6] eval_user_input(ast::Any, backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
  [7] repl_backend_loop(backend::REPL.REPLBackend)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
  [8] start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
  [9] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
 [10] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
 [11] (::Base.var"#865#867"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:387
 [12] #invokelatest#2
    @ ./essentials.jl:707 [inlined]
 [13] invokelatest
    @ ./essentials.jl:706 [inlined]
 [14] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:372
 [15] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:302
 [16] _start()
    @ Base ./client.jl:485
julia> CUDA.unsafe_free!(b)
                                    
                                    
                                    
                                
You're confusing UVA with UVM. I'm not talking about unified memory here, so there's no performance impact.
This would also allow us to just use cuMemcpyAsync instead of maintaining the different versions.