cupy Support system allocated memory

Support system allocated memory (SAM). This is a new way to allocate device-accessible memory in two environments:

HMM: Heterogeneous Memory Management, a software solution to support SAM in x86 PCIe based systems.
ATS: Address Translation Service, a hardware solution on the Grace Hopper superchip.

Mostly based on @leofang's draft change with some cleanups.

@emcastillo

Jul 25 '24 22:07 rongou

cc @seberg (as this may be of interest)

Jul 25 '24 23:07 jakirkham

@leofang @emcastillo merged current main to the draft change to preserve commit history.

Jul 25 '24 23:07 rongou

/test mini

Jul 30 '24 23:07 leofang

It seems I don't have privilege to trigger CIs for first-time contributors lol

Jul 30 '24 23:07 leofang

Maybe Rong could do a small doc or bug fix in another PR so that he establishes his contributor cred?

Edit: So we can more easily start CI :)

Jul 31 '24 23:07 jakirkham

@rongou could you undo commit c95585dc15b0b88fa3d90ecb6c1e834326e24fa8? It is needed in order to ensure it's functional on using malloc'd memory on G+H. I'll probably have to find time to write a full PR description to explain some implementation details. Will share with you a note offline momentarily in the meanwhile.

Jul 31 '24 23:07 leofang

@rongou could you undo commit c95585d? It is needed in order to ensure it's functional on using malloc'd memory on G+H. I'll probably have to find time to write a full PR description to explain some implementation details. Will share with you a note offline momentarily in the meanwhile.

From the CI results looks likt it breaks windows builds. We'd need to either add a windows implementation, or guard it against compiling under windows.

Aug 01 '24 00:08 rongou

From the CI results looks likt it breaks windows builds. We'd need to either add a windows implementation, or guard it against compiling under windows.

Ah, interesting. I see two ways out

To support Windows, we need to change to _aligned_malloc and add a free wrapper that calls _aligned_free
- xref: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=msvc-170
Or, we make line 131 in install/cupy_builder/_features.py conditioned on whether the build/target platform is Windows

Aug 01 '24 00:08 leofang

Perhaps it's best to skip Windows (option 2) since managed memory is not doing great on Windows and there's no Windows support for G+H anyway.

Aug 01 '24 00:08 leofang

Added back with windows support.

Aug 01 '24 18:08 rongou

I will try to make any changes needed to land this PR. So far is there any concern that hasn't been addressed?

@kmaehashi can you assign me to get permission to write to @rongou's branch? Thanks!

Sep 02 '24 04:09 emcastillo

@kmaehashi can you assign me to get permission to write to @rongou's branch? Thanks!

I think the permission is already granted 😃

Sep 02 '24 04:09 kmaehashi

Sorry for late reply. Can I also get access to @rongou's branch, please? There are few changes I'd like to add.

Sep 08 '24 19:09 leofang

Sorry for late reply. Can I also get access to @rongou's branch, please? There are few changes I'd like to add.

This is related to the admin permission to the main CuPy repo itself and I'd needs some assessment before doing this. Would you mind opening another PR for now? 🙇🏼

Sep 25 '24 01:09 kmaehashi

I'll probably have to find time to write a full PR description to explain some implementation details.

In this PR, we add support for HMM/ATS memory systems so that users can opt-in and use malloc or cudaMallocManaged to allocate memory that is accessible to both CPU and GPU. This is a programming model feature allowing users of NVIDIA Grace Hopper systems to take full advantage of the coherent memory system, with very little code change (only in the process start-up stage).

This opt-in mechanism is currently guarded in a few steps, as outlined in both the doc change as well as test setup:

Set CUPY_ENABLE_UMP=1
Install numpy_allocator
Set an allocator for both NumPy (CPU) and CuPy (GPU)

The net effect is we use NumPy and CuPy to represent CPU and GPU execution spaces, respectively, but the memory space is now unified, and memory transfer between CPU and GPU will become no-op if conditions are met. A proper stream synchronization is inserted wherever applicable, in order to ensure CPU/GPU do not have race condition. In particular, the D2H copies are skipped by supporting the Python buffer protocol, which is generic enough for any host access not just NumPy.

Allocators for both NumPy and CuPy are needed because

CuPy: we need a mechanism to
- allocate system memory (via malloc)
- identify where the memory is from, quickly (via mem.identity)
NumPy: we need to ensure allocated memory has right alignment for CuPy's kernels to consume, in order to avoid data corruption

Due to numpy_allocator's requirement, we expose the memory allocation routines' symbols via cdef public, but with C instead of C++ linkage.

Sep 25 '24 06:09 leofang

Rendered document: https://cupy--8442.org.readthedocs.build/en/8442/user_guide/memory.html#unified-memory-programming-ump-support-experimental

Sep 25 '24 06:09 leofang

I'm not sure cudaMallocManaged is relevant here. Although a buffer allocated through it can be accessed from both cpu and gpu, we'd have to explicitly call it, which any non-CUDA code is definitely not doing. But with HMM/ATS, any system malloc allocated memory can be potentially accessed from the gpu, which is main issue addressed in this PR. That's why I think spelling out SAM is helpful to clarify the issue.

Sep 25 '24 15:09 rongou

Calling out cudaMallocManaged is important because from the allocator (ex: RMM 🙂) perspective it does not really matter if under the hood it's malloc or cudaMallocManaged in use. As long as a memory resource is both host-/device- accessible, they can be used here. (Though, I think there are a few pieces missing to fully support cudaMallocManaged as far as unified memory programming is concerned. We need to address this in a separate PR.) Another important reason is that with an early driver (from ~CUDA 12.2 IIRC) the performance of cudaMallocManaged was better than malloc on G+H in certain use cases. The gap is closing but still not on par AFAIK.

Sep 25 '24 18:09 leofang

But isn't this PR mainly about avoiding unnecessary copies with SAM? For example, numpy would never allocate memory through cudaMallocManaged, so you do need to copy in that case.

Sep 25 '24 19:09 rongou

This PR already has the necessary mechanism in place (https://github.com/cupy/cupy/pull/8442#discussion_r1774390960) to handle zero-copy managed memory. One of the missing things that we can add later is, as you pointed out, a different numpy_allocator C APIs that calls cudaMallocManaged instead of malloc for NumPy to use. (And, the alignment treatment would be simpler there, because cudaMallocManaged is also 256B-aligned IIRC, same as cudaMalloc.)

Sep 25 '24 19:09 leofang

https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

Sep 25 '24 19:09 leofang

Q: Are there any lightweight perf regression tests that we can run to confirm we don't add overhead due to this change when UMP is not in use (which the majority of CuPy use cases)?

Oct 15 '24 14:10 leofang

This pull request is now in conflicts. Could you fix it @rongou? 🙏

Oct 17 '24 13:10 mergify[bot]

/test mini

Oct 27 '24 10:10 kmaehashi

Q: Are there any lightweight perf regression tests that we can run to confirm we don't add overhead due to this change when UMP is not in use (which the majority of CuPy use cases)?

I guess https://github.com/cupy/cupy-performance (work done by @emcastillo) can be used for that purpose?

Oct 27 '24 10:10 kmaehashi

Thanks!!

I will run the performance tests tomorrow to check if we have regressions! lets wait until its done for merging 😇

Oct 27 '24 14:10 emcastillo

main

time_add             - case        100:    CPU:     5.329 us   +/-  0.171 (min:     5.050 / max:     6.110) us     GPU-0:     7.850 us   +/-  0.405 (min:     6.816 / max:     8.416) us
time_add             - case        200:    CPU:     5.420 us   +/-  0.175 (min:     5.131 / max:     6.154) us     GPU-0:    15.181 us   +/-  0.516 (min:    14.336 / max:    16.192) us
time_add             - case        500:    CPU:     5.327 us   +/-  0.179 (min:     4.963 / max:     5.752) us     GPU-0:    61.522 us   +/-  0.471 (min:    60.320 / max:    63.040) us
time_add             - case       1000:    CPU:     5.445 us   +/-  0.256 (min:     5.082 / max:     6.513) us     GPU-0:   225.683 us   +/-  0.517 (min:   224.992 / max:   227.008) us
time_divide          - case        100:    CPU:     5.305 us   +/-  0.119 (min:     5.002 / max:     5.567) us     GPU-0:    13.009 us   +/-  0.530 (min:    12.288 / max:    13.984) us
time_divide          - case        200:    CPU:     5.391 us   +/-  0.212 (min:     4.903 / max:     6.474) us     GPU-0:    31.530 us   +/-  0.524 (min:    30.720 / max:    32.768) us
time_divide          - case        500:    CPU:     5.363 us   +/-  0.115 (min:     5.097 / max:     5.690) us     GPU-0:   158.026 us   +/-  0.478 (min:   157.504 / max:   159.392) us
time_divide          - case       1000:    CPU:     5.383 us   +/-  0.145 (min:     5.006 / max:     5.830) us     GPU-0:   608.937 us   +/-  0.526 (min:   608.256 / max:   610.080) us
time_divmod          - case        100:    CPU:     6.665 us   +/-  0.625 (min:     6.349 / max:    10.940) us     GPU-0:    15.572 us   +/-  0.865 (min:    14.464 / max:    20.000) us
time_divmod          - case        200:    CPU:     6.638 us   +/-  0.131 (min:     6.376 / max:     7.155) us     GPU-0:    39.015 us   +/-  0.342 (min:    38.624 / max:    40.352) us
time_divmod          - case        500:    CPU:     6.636 us   +/-  0.172 (min:     6.343 / max:     7.385) us     GPU-0:   199.698 us   +/-  0.602 (min:   198.656 / max:   200.544) us
time_divmod          - case       1000:    CPU:     6.546 us   +/-  0.283 (min:     6.192 / max:     7.851) us     GPU-0:   774.010 us   +/-  0.553 (min:   772.800 / max:   775.872) us

This PR

time_add             - case        100:    CPU:     6.097 us   +/-  2.922 (min:     5.311 / max:    26.124) us     GPU-0:     8.548 us   +/-  2.633 (min:     6.976 / max:    26.336) us
time_add             - case        200:    CPU:     5.566 us   +/-  0.135 (min:     5.325 / max:     6.150) us     GPU-0:    15.398 us   +/-  0.619 (min:    14.336 / max:    16.384) us
time_add             - case        500:    CPU:     5.578 us   +/-  0.123 (min:     5.311 / max:     5.881) us     GPU-0:    61.663 us   +/-  0.358 (min:    60.512 / max:    62.688) us
time_add             - case       1000:    CPU:     5.600 us   +/-  0.179 (min:     5.325 / max:     6.300) us     GPU-0:   225.651 us   +/-  0.501 (min:   225.152 / max:   227.136) us
time_divide          - case        100:    CPU:     5.573 us   +/-  0.590 (min:     5.285 / max:     9.615) us     GPU-0:    13.924 us   +/-  2.598 (min:    12.288 / max:    31.168) us
time_divide          - case        200:    CPU:     5.455 us   +/-  0.144 (min:     5.183 / max:     6.174) us     GPU-0:    31.781 us   +/-  0.558 (min:    30.720 / max:    32.768) us
time_divide          - case        500:    CPU:     5.497 us   +/-  0.164 (min:     5.180 / max:     6.370) us     GPU-0:   158.040 us   +/-  0.465 (min:   157.440 / max:   159.168) us
time_divide          - case       1000:    CPU:     5.472 us   +/-  0.168 (min:     5.221 / max:     6.019) us     GPU-0:   609.254 us   +/-  0.630 (min:   608.256 / max:   610.304) us
time_divmod          - case        100:    CPU:     6.782 us   +/-  0.188 (min:     6.550 / max:     7.526) us     GPU-0:    15.693 us   +/-  0.577 (min:    14.432 / max:    16.640) us
time_divmod          - case        200:    CPU:     7.320 us   +/-  3.447 (min:     6.380 / max:    31.090) us     GPU-0:    39.539 us   +/-  3.396 (min:    38.688 / max:    63.232) us
time_divmod          - case        500:    CPU:     6.844 us   +/-  0.167 (min:     6.587 / max:     7.422) us     GPU-0:   199.846 us   +/-  0.593 (min:   198.752 / max:   200.704) us
time_divmod          - case       1000:    CPU:     6.702 us   +/-  0.256 (min:     6.363 / max:     7.934) us     GPU-0:   774.319 us   +/-  0.461 (min:   773.888 / max:   776.064) us

There is some impact but it is around ~0.2 us for ufuncs which are the functions with less CPU time.

Oct 28 '24 04:10 emcastillo

Unfortunately we cannot use that (which is great work nonetheless!) right away, because CuPy does not include NumPy as a build-time dependency at all and it's a big move for such a niche use case. Furthermore, this PR is ready for a quick final pass/merge, and I would hate to further delay the progress here (we got pinged every week by an internal team, also neither Emilio nor I have extra bandwidth in accommodating this suggestion). Please let's table the discussion in a separate issue and move on 🙏

Oct 29 '24 15:10 leofang

Yeah, we are being pinged a lot about this, so it would be great if we can merge it in the current state and work on optimizations in a follow up PR 😁

Oct 29 '24 15:10 emcastillo

Thanks to all involved in the PR! Let me proceed to merge as discussed above.

Oct 29 '24 16:10 kmaehashi

cupy cupy copied to clipboard

Support system allocated memory

cupy
cupy copied to clipboard