aomp Runtime overhead on H2D and D2H tranfers

This figure comes from a single offload region. https://github.com/ye-luo/miniqmc/blob/46073436a432fc0472bf793784f9f87b5f8fdfcb/src/QMCWaveFunctions/einspline_spo_omp.cpp#L407

The map to/from arrays are already mapped with "target enter data" Why there are still memory_pool_allocate and agents_allow access and memory_pool_free for these arrays?

Screenshot from 2020-10-08 17-40-09

Oct 08 '20 23:10 ye-luo

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

Oct 09 '20 00:10 JonChesterfield

A partial patch landed at https://github.com/ROCm-Developer-Tools/amd-llvm-project/pull/174. The HSA api exposes 'lock', which sometimes succeeds, and a synchronous memcpy, which sometimes fails. I'm trying to work out if that's a constraint on how the functions should be called or a bug in HSA.

Things should now be faster on the happy path. Continuing to look into it.

Oct 15 '20 14:10 JonChesterfield

This will be resolved in Nov 2 release of AOMP 11.11-0

Oct 26 '20 12:10 gregrodgers

I'm afraid I forgot about this after the partial fix above. The cuda plugin overlaps tasks that we run sequentially, which I think makes this the right test case for asynchronous offloading

Apr 12 '21 21:04 JonChesterfield

@dhruvachak Can you check this issue on aomp 13.0-2?

Apr 20 '21 13:04 gregrodgers

That's curious. I thought the per-launch allocations had all been removed, though there is probably still allocation on the first target launch. I'll look into this next week.

Edit: this is probably the allocation in atmi_memcpy. Data is (sometimes unnecessarily) staged to a fine grain region before coping, and iirc that fine grain region is presently allocated per memcpy. If memory_pool_allocate is expensive, which it easily could be, we should do some buffer reuse instead.

I still saw the heavy stagging https://github.com/llvm/llvm-project/blob/0d1d058544446b5fc74e70827a021a017facc56a/openmp/libomptarget/plugins/amdgpu/impl/impl.cpp#L65. I'm wondering if we can use https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820 to query if the host pointer has been pinned https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1729 already and skip staging.

Nov 15 '21 05:11 ye-luo

What we should have now is unconditionally try to lock the memory and if that succeeds make the memcpy call on the passed argument. I think that is more reliable than the previous strategy. If the lock fails we still do the allocate + memcpy fallback, this seems to be necessary for pointers into rodata and is probably required for pointers into the host stack.

That is, on trunk (and I think amd-stg-open, but am not sure about aomp) you should no longer see the allocate on the profiler, assuming you're passing heap allocated data into the target region.

Dec 15 '21 15:12 JonChesterfield

On NVIDIA GPUs, registering pinned memory is a quite expensive operation while checking a pointer to see if it has been pinned already is cheap. I still have a feeling that always trying to lock is sub-optimal.

Dec 15 '21 15:12 ye-luo

That's probably worth checking. @carlobertolli?

Question is whether calling lock on already locked memory is more expensive than some other query we could make of it. My expectation is that roct has to do ~ the same check we'd do so there would be no saving.

Regardless, the really expensive thing is to allocate N bytes of host memory and copy over it, and that should now be gone for all the cases that we know it can be elided for.

Dec 15 '21 18:12 JonChesterfield

The hsa is not well documented and I have difficulty to make it working as expected. https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/128

I doubt hsa does reference counting and locking and unlocking on already locked memory may cause undefined behaviors.

Dec 15 '21 18:12 ye-luo

Looks refcounted to me. fmm.c in roct, e.g. fmm_deregister_memory

        if (object->registration_count > 1) {
                --object->registration_count;
                pthread_mutex_unlock(&aperture->fmm_mutex);
                return HSAKMT_STATUS_SUCCESS;
        }

where fmm_register_memory increments the same counter iff vm_find_object found an existing one.

I don't know offhand what the relationship between hsa_amd_pointer_info and hsa_amd_memory_lock is intended to be. OpenMP isn't using hsa_amd_pointer_info at present.

Dec 15 '21 18:12 JonChesterfield

It is good the hsa does reference counting. In my measurements, tracer shows lock takes 35us and unlock takes 7us. They are not cheap. If I pinned via HIP ahead of computing, lock takes 0.9us and unlock takes roughly 0.7us.

From rocprof. It is a bit more messy and the time doesn't really match the tracer. Without user pinned.

$ cat results.hsa_stats.csv |grep lock
hsa_amd_memory_lock,803,68153169,84873,9.288249639958051
hsa_amd_memory_unlock,803,19387176,24143,2.6421798596306414

With user pinned memories.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
hsa_amd_memory_lock_to_pool,36,41889377,1163593,12.713881187629408
hsa_amd_memory_lock,803,41304882,51438,12.536480602637097
hsa_amd_memory_unlock,838,23449281,27982,7.1171116383357935

The time drop on hsa_amd_memory_lock is very visible. If you are not happy with the printout layout blame rocprof https://github.com/RadeonOpenCompute/ROCm/issues/1640

Dec 16 '21 00:12 ye-luo

More expensive than I expected, thanks for measuring.

I think we're stuck locking it though - we need the memory to remain locked during the call, and if it happened to be locked by some other code we don't know that said other code will keep it locked for the duration.

E.g. another thread might have locked (before our call) and then unlocked (during our call) the same memory (this is probably page granularity, so not even necessarily the same variable).

edit: a potentially interesting move is to hand out locked memory in omp allocate and keep track of the pointers thus handed out, but it's hard to tell whether that would be a win overall. Locked memory is a scarcer resource than address space too. Maybe the way to go is to check what rocr/roct are doing and see if that can be made cheaper.

Dec 17 '21 17:12 JonChesterfield

+Carlo

Jan 21 '22 19:01 ronlieb

@ye-luo I've run a small test using ROCr directly from c++ for the tthree cases below. I measure the time elapsed in between the curly brackets {}.

lock is included in elapsed time

{
  lock(a, locked_a); // not locked
  async_memcpy(locked_a);
}

memcopy only, no lock in elapsed time

lock(a, locked_a);
{
  async_memcpy(locked_a);
}

re-lock included in elapsed time

lock(a, locked_a1);
{
  lock(locked_a1, locked_a2); // already locked
  async_memcpy(locked_a2);
}

The pointer being locked refers to a 10^8 double array. As I believe you point out:

Is the slowest by several orders of magnitude than 2 and 3
2 and 3 perform similarly (I could not detect a difference).

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

Does this sound reasonable to you?

Jan 28 '22 16:01 carlobertolli

If my measurement is mapped to your code example, case 1 the lock part takes 7us. case 3 the inner lock takes 0.9us.

Conclusion: if you have already locked a pointer by the time the amdgpu omptarget plugin tries to lock it again, then you would not pay for the cost of relocking.

I would say the cost of re-locking is relatively low but it is not negligible. I also read the timing of unlock from the trace. It took 0.7us. So in total, it is 1.6us which is still not small. If the lock status can be detected ahead, that 1.6us can be saved as well. My experiment runs with 1 thread but QMCPACK does concurrent offload. In that scenario, the cost of talking to the runtime definitely goes beyond 1.6us.

Jan 28 '22 16:01 ye-luo

The upstream plugin-nextgen keeps track of locked memory https://reviews.llvm.org/D142514 and aomp inherits it. Right now, the hsa async copy remains very slow and its developers are aware of it and working on a fix now.

Apr 28 '23 15:04 ye-luo

aomp aomp copied to clipboard

Runtime overhead on H2D and D2H tranfers

aomp
aomp copied to clipboard