AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Preliminary fix for Base.unsafe_wrap

Open matinraayai opened this issue 1 year ago • 6 comments

This PR should fix Base.unsafe_wrap, which allocated new memory instead of wrapping over the passed pointer.

matinraayai avatar Jan 22 '24 18:01 matinraayai

@pxl-th can you please review? Also are there plans to make a distinction between Host-accessible buffers and managed buffers in AMDGPU.jl?

matinraayai avatar Jan 23 '24 21:01 matinraayai

which allocated new memory instead of wrapping over the passed pointer

Can you show an example of when it allocated? Both HostBuffer and HIPBuffer when passed a pointer and bytesize don't allocate.

Wrapping host array
julia> x = zeros(Float32, 4, 4);

julia> xd = unsafe_wrap(ROCArray, pointer(x), size(x));
:3:hip_context.cpp          :152 : 2775812923 us: [pid:13824 tid:0x7fa8c6f40d00]  hipCtxCreate ( 0x7ffdc05f8250, 0, 0 ) 
:3:hip_context.cpp          :164 : 2775812936 us: [pid:13824 tid:0x7fa8c6f40d00] hipCtxCreate: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :652 : 2775812941 us: [pid:13824 tid:0x7fa8c6f40d00]  hipSetDevice ( 0 ) 
:3:hip_device_runtime.cpp   :656 : 2775812945 us: [pid:13824 tid:0x7fa8c6f40d00] hipSetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :637 : 2775812955 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDeviceCount ( 0x7ffdc05f84e8 ) 
:3:hip_device_runtime.cpp   :639 : 2775812959 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDeviceCount: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775812963 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f83f8 ) 
:3:hip_device_runtime.cpp   :630 : 2775812966 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_stream.cpp           :375 : 2775812975 us: [pid:13824 tid:0x7fa8c6f40d00]  hipStreamCreateWithPriority ( 0x7fa8bc099620, 0, 0 ) 
:3:rocdevice.cpp            :2768: 2775812984 us: [pid:13824 tid:0x7fa8c6f40d00] number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2846: 2775825071 us: [pid:13824 tid:0x7fa8c6f40d00] created hardware queue 0x7fa89c00a000 with size 16384 with priority 1, cooperative: 0
:3:rocdevice.cpp            :2938: 2775825081 us: [pid:13824 tid:0x7fa8c6f40d00] acquireQueue refCount: 0x7fa89c00a000 (1)
:4:rocdevice.cpp            :2099: 2775825338 us: [pid:13824 tid:0x7fa8c6f40d00] Allocate hsa host memory 0x7fa6e0600000, size 0x100000
:3:devprogram.cpp           :2686: 2775957593 us: [pid:13824 tid:0x7fa8c6f40d00] Using Code Object V5.
:3:hip_stream.cpp           :390 : 2775958610 us: [pid:13824 tid:0x7fa8c6f40d00] hipStreamCreateWithPriority: Returned hipSuccess : stream:0xb7eb70
:3:hip_device_runtime.cpp   :622 : 2775958618 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8518 ) 
:3:hip_device_runtime.cpp   :630 : 2775958622 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775958632 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8538 ) 
:3:hip_device_runtime.cpp   :630 : 2775958635 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :1233: 2775958655 us: [pid:13824 tid:0x7fa8c6f40d00]  hipHostRegister ( 0x7fa8bcd1bf40, 64, 2 ) 
:4:rocmemory.cpp            :967 : 2775958684 us: [pid:13824 tid:0x7fa8c6f40d00] Locking to pool 0x7802e0, size 0x40, HostPtr = 0x7fa8bcd1bf40, DevPtr = 0x7fa8bcd1bf40
:3:hip_memory.cpp           :1235: 2775958688 us: [pid:13824 tid:0x7fa8c6f40d00] hipHostRegister: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775958692 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f84d8 ) 
:3:hip_device_runtime.cpp   :630 : 2775958694 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :3350: 2775958702 us: [pid:13824 tid:0x7fa8c6f40d00]  hipHostGetDevicePointer ( 0x7fa8bc099630, 0x7fa8bcd1bf40, 0 ) 
:3:hip_memory.cpp           :3364: 2775958706 us: [pid:13824 tid:0x7fa8c6f40d00] hipHostGetDevicePointer: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775963126 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2775963133 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2775963139 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2775963143 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 
Wrapping device array
julia> x = ROCArray{Float32}(undef, 4, 4);
:3:hip_device_runtime.cpp   :622 : 2870790570 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8438 ) 
:3:hip_device_runtime.cpp   :630 : 2870790582 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_mempool.cpp          :59  : 2870790591 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceGetMemPool ( 0x7fa8be5ed090, 0 ) 
:3:hip_mempool.cpp          :64  : 2870790594 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceGetMemPool: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2870790686 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8058 ) 
:3:hip_device_runtime.cpp   :630 : 2870790690 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_mempool.cpp          :69  : 2870790699 us: [pid:13824 tid:0x7fa8c6f40d00]  hipMallocAsync ( 0x7fa8be5ed110, 64, stream:0xb7eb70 ) 
:4:rocdevice.cpp            :2227: 2870790782 us: [pid:13824 tid:0x7fa8c6f40d00] Allocate hsa device memory 0x7fa6e0200000, size 0x40
:3:rocdevice.cpp            :2266: 2870790786 us: [pid:13824 tid:0x7fa8c6f40d00] device=0xb19f90, freeMem_ = 0x5feffffc0
:3:hip_mempool_impl.cpp     :208 : 2870790791 us: [pid:13824 tid:0x7fa8c6f40d00] Pool AllocMem: 0x7fa6e0200000, 0x10b7250
:3:hip_mempool.cpp          :88  : 2870790795 us: [pid:13824 tid:0x7fa8c6f40d00] hipMallocAsync: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2870794261 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2870794267 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2870794271 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2870794275 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 

julia> xd = unsafe_wrap(ROCArray, pointer(x), size(x); lock=false);
:3:hip_device_runtime.cpp   :622 : 2875471365 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2875471383 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2875471392 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2875471399 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 

You can see that in both cases logs do not display any allocation.

Also the tests needs to be updated, they are failing right now.

Also are there plans to make a distinction between Host-accessible buffers and managed buffers in AMDGPU.jl?

Ideally yes, but I don't have much time at the moment to work on that.

pxl-th avatar Jan 24 '24 09:01 pxl-th

@pxl-th even if it doesn't allocate, it still doesn't control the ownership of the buffer. The Base definition has a keyword argument own that can prevent that (do correct me if I'm wrong). The lock keyword is not necessary, since the HIP runtime should tell us the attributes of the pointer. This brings AMDGPU.jl closer to what CUDA.jl (and Base) does for this function here: https://github.com/JuliaGPU/CUDA.jl/blob/275c9bc8f04b7743c34703207f6f8619317d5633/src/array.jl#L234 I will update the tests once I get confirmation that this can be merged to AMDGPU.jl.

I might be able to contribute managed buffers to AMDGPU.jl if no one is doing it.

matinraayai avatar Jan 24 '24 16:01 matinraayai

I don't have any objections regarding these changes (as long as they work). lock keyword is more of a remnant from the HSA era.

FYI, current implementation also does not take the ownership, when creating HostBuffer or HIPBuffer from pointer and its bytesize it marks it as non-owning (last argument), so it won't free the memory on itself.

@luraess this might be breaking for you.

pxl-th avatar Jan 24 '24 17:01 pxl-th

@pxl-th you are correct, I edited my last comment to reflect this. At its current state however, unsafe_wrap function implemented by AMDGPU.jl does not confirm to the Base (it doesn't allow taking ownership), and does not determine the attributes of the pointer automatically (it should not determine the pointer type via the lock argument).

I will update the tests and let CI run this.

In terms of implementing the managed buffer, is there anything on the LLVM side I should fix (e.g. address space)? Also do you know what the exact difference between managed buffers and host-accessible buffers in HIP/HSA?

matinraayai avatar Jan 24 '24 18:01 matinraayai

@luraess this might be breaking for you.

The one place we are currently using unsafe_wrap is in https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/08003fcf7f1f4d3db07ceff0cebc7aac9784659b/src/AMDGPUExt/shared.jl#L41 which is then further used in order to register mem in https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/08003fcf7f1f4d3db07ceff0cebc7aac9784659b/src/AMDGPUExt/update_halo.jl#L86-L90.

Would be good to sync on that.

luraess avatar Jan 25 '24 00:01 luraess