cupy
cupy copied to clipboard
Support system allocated memory
Support system allocated memory (SAM). This is a new way to allocate device-accessible memory in two environments:
- HMM: Heterogeneous Memory Management, a software solution to support SAM in x86 PCIe based systems.
- ATS: Address Translation Service, a hardware solution on the Grace Hopper superchip.
Mostly based on @leofang's draft change with some cleanups.
@emcastillo
cc @seberg (as this may be of interest)
@leofang @emcastillo merged current main to the draft change to preserve commit history.
/test mini
It seems I don't have privilege to trigger CIs for first-time contributors lol
Maybe Rong could do a small doc or bug fix in another PR so that he establishes his contributor cred?
Edit: So we can more easily start CI :)
@rongou could you undo commit c95585dc15b0b88fa3d90ecb6c1e834326e24fa8? It is needed in order to ensure it's functional on using malloc'd memory on G+H. I'll probably have to find time to write a full PR description to explain some implementation details. Will share with you a note offline momentarily in the meanwhile.
@rongou could you undo commit c95585d? It is needed in order to ensure it's functional on using malloc'd memory on G+H. I'll probably have to find time to write a full PR description to explain some implementation details. Will share with you a note offline momentarily in the meanwhile.
From the CI results looks likt it breaks windows builds. We'd need to either add a windows implementation, or guard it against compiling under windows.
From the CI results looks likt it breaks windows builds. We'd need to either add a windows implementation, or guard it against compiling under windows.
Ah, interesting. I see two ways out
- To support Windows, we need to change to
_aligned_mallocand add afreewrapper that calls_aligned_free- xref: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=msvc-170
- Or, we make line 131 in
install/cupy_builder/_features.pyconditioned on whether the build/target platform is Windows
Perhaps it's best to skip Windows (option 2) since managed memory is not doing great on Windows and there's no Windows support for G+H anyway.
Added back with windows support.
I will try to make any changes needed to land this PR. So far is there any concern that hasn't been addressed?
@kmaehashi can you assign me to get permission to write to @rongou's branch? Thanks!
@kmaehashi can you assign me to get permission to write to @rongou's branch? Thanks!
I think the permission is already granted 😃
Sorry for late reply. Can I also get access to @rongou's branch, please? There are few changes I'd like to add.
Sorry for late reply. Can I also get access to @rongou's branch, please? There are few changes I'd like to add.
This is related to the admin permission to the main CuPy repo itself and I'd needs some assessment before doing this. Would you mind opening another PR for now? 🙇🏼
I'll probably have to find time to write a full PR description to explain some implementation details.
In this PR, we add support for HMM/ATS memory systems so that users can opt-in and use malloc or cudaMallocManaged to allocate memory that is accessible to both CPU and GPU. This is a programming model feature allowing users of NVIDIA Grace Hopper systems to take full advantage of the coherent memory system, with very little code change (only in the process start-up stage).
This opt-in mechanism is currently guarded in a few steps, as outlined in both the doc change as well as test setup:
- Set
CUPY_ENABLE_UMP=1 - Install
numpy_allocator - Set an allocator for both NumPy (CPU) and CuPy (GPU)
The net effect is we use NumPy and CuPy to represent CPU and GPU execution spaces, respectively, but the memory space is now unified, and memory transfer between CPU and GPU will become no-op if conditions are met. A proper stream synchronization is inserted wherever applicable, in order to ensure CPU/GPU do not have race condition. In particular, the D2H copies are skipped by supporting the Python buffer protocol, which is generic enough for any host access not just NumPy.
Allocators for both NumPy and CuPy are needed because
- CuPy: we need a mechanism to
- allocate system memory (via
malloc) - identify where the memory is from, quickly (via
mem.identity)
- allocate system memory (via
- NumPy: we need to ensure allocated memory has right alignment for CuPy's kernels to consume, in order to avoid data corruption
Due to numpy_allocator's requirement, we expose the memory allocation routines' symbols via cdef public, but with C instead of C++ linkage.
Rendered document: https://cupy--8442.org.readthedocs.build/en/8442/user_guide/memory.html#unified-memory-programming-ump-support-experimental
I'm not sure cudaMallocManaged is relevant here. Although a buffer allocated through it can be accessed from both cpu and gpu, we'd have to explicitly call it, which any non-CUDA code is definitely not doing. But with HMM/ATS, any system malloc allocated memory can be potentially accessed from the gpu, which is main issue addressed in this PR. That's why I think spelling out SAM is helpful to clarify the issue.
Calling out cudaMallocManaged is important because from the allocator (ex: RMM 🙂) perspective it does not really matter if under the hood it's malloc or cudaMallocManaged in use. As long as a memory resource is both host-/device- accessible, they can be used here. (Though, I think there are a few pieces missing to fully support cudaMallocManaged as far as unified memory programming is concerned. We need to address this in a separate PR.) Another important reason is that with an early driver (from ~CUDA 12.2 IIRC) the performance of cudaMallocManaged was better than malloc on G+H in certain use cases. The gap is closing but still not on par AFAIK.
But isn't this PR mainly about avoiding unnecessary copies with SAM? For example, numpy would never allocate memory through cudaMallocManaged, so you do need to copy in that case.
This PR already has the necessary mechanism in place (https://github.com/cupy/cupy/pull/8442#discussion_r1774390960) to handle zero-copy managed memory. One of the missing things that we can add later is, as you pointed out, a different numpy_allocator C APIs that calls cudaMallocManaged instead of malloc for NumPy to use. (And, the alignment treatment would be simpler there, because cudaMallocManaged is also 256B-aligned IIRC, same as cudaMalloc.)
https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
Q: Are there any lightweight perf regression tests that we can run to confirm we don't add overhead due to this change when UMP is not in use (which the majority of CuPy use cases)?
This pull request is now in conflicts. Could you fix it @rongou? 🙏
/test mini
Q: Are there any lightweight perf regression tests that we can run to confirm we don't add overhead due to this change when UMP is not in use (which the majority of CuPy use cases)?
I guess https://github.com/cupy/cupy-performance (work done by @emcastillo) can be used for that purpose?
Thanks!!
I will run the performance tests tomorrow to check if we have regressions! lets wait until its done for merging 😇
main
time_add - case 100: CPU: 5.329 us +/- 0.171 (min: 5.050 / max: 6.110) us GPU-0: 7.850 us +/- 0.405 (min: 6.816 / max: 8.416) us
time_add - case 200: CPU: 5.420 us +/- 0.175 (min: 5.131 / max: 6.154) us GPU-0: 15.181 us +/- 0.516 (min: 14.336 / max: 16.192) us
time_add - case 500: CPU: 5.327 us +/- 0.179 (min: 4.963 / max: 5.752) us GPU-0: 61.522 us +/- 0.471 (min: 60.320 / max: 63.040) us
time_add - case 1000: CPU: 5.445 us +/- 0.256 (min: 5.082 / max: 6.513) us GPU-0: 225.683 us +/- 0.517 (min: 224.992 / max: 227.008) us
time_divide - case 100: CPU: 5.305 us +/- 0.119 (min: 5.002 / max: 5.567) us GPU-0: 13.009 us +/- 0.530 (min: 12.288 / max: 13.984) us
time_divide - case 200: CPU: 5.391 us +/- 0.212 (min: 4.903 / max: 6.474) us GPU-0: 31.530 us +/- 0.524 (min: 30.720 / max: 32.768) us
time_divide - case 500: CPU: 5.363 us +/- 0.115 (min: 5.097 / max: 5.690) us GPU-0: 158.026 us +/- 0.478 (min: 157.504 / max: 159.392) us
time_divide - case 1000: CPU: 5.383 us +/- 0.145 (min: 5.006 / max: 5.830) us GPU-0: 608.937 us +/- 0.526 (min: 608.256 / max: 610.080) us
time_divmod - case 100: CPU: 6.665 us +/- 0.625 (min: 6.349 / max: 10.940) us GPU-0: 15.572 us +/- 0.865 (min: 14.464 / max: 20.000) us
time_divmod - case 200: CPU: 6.638 us +/- 0.131 (min: 6.376 / max: 7.155) us GPU-0: 39.015 us +/- 0.342 (min: 38.624 / max: 40.352) us
time_divmod - case 500: CPU: 6.636 us +/- 0.172 (min: 6.343 / max: 7.385) us GPU-0: 199.698 us +/- 0.602 (min: 198.656 / max: 200.544) us
time_divmod - case 1000: CPU: 6.546 us +/- 0.283 (min: 6.192 / max: 7.851) us GPU-0: 774.010 us +/- 0.553 (min: 772.800 / max: 775.872) us
This PR
time_add - case 100: CPU: 6.097 us +/- 2.922 (min: 5.311 / max: 26.124) us GPU-0: 8.548 us +/- 2.633 (min: 6.976 / max: 26.336) us
time_add - case 200: CPU: 5.566 us +/- 0.135 (min: 5.325 / max: 6.150) us GPU-0: 15.398 us +/- 0.619 (min: 14.336 / max: 16.384) us
time_add - case 500: CPU: 5.578 us +/- 0.123 (min: 5.311 / max: 5.881) us GPU-0: 61.663 us +/- 0.358 (min: 60.512 / max: 62.688) us
time_add - case 1000: CPU: 5.600 us +/- 0.179 (min: 5.325 / max: 6.300) us GPU-0: 225.651 us +/- 0.501 (min: 225.152 / max: 227.136) us
time_divide - case 100: CPU: 5.573 us +/- 0.590 (min: 5.285 / max: 9.615) us GPU-0: 13.924 us +/- 2.598 (min: 12.288 / max: 31.168) us
time_divide - case 200: CPU: 5.455 us +/- 0.144 (min: 5.183 / max: 6.174) us GPU-0: 31.781 us +/- 0.558 (min: 30.720 / max: 32.768) us
time_divide - case 500: CPU: 5.497 us +/- 0.164 (min: 5.180 / max: 6.370) us GPU-0: 158.040 us +/- 0.465 (min: 157.440 / max: 159.168) us
time_divide - case 1000: CPU: 5.472 us +/- 0.168 (min: 5.221 / max: 6.019) us GPU-0: 609.254 us +/- 0.630 (min: 608.256 / max: 610.304) us
time_divmod - case 100: CPU: 6.782 us +/- 0.188 (min: 6.550 / max: 7.526) us GPU-0: 15.693 us +/- 0.577 (min: 14.432 / max: 16.640) us
time_divmod - case 200: CPU: 7.320 us +/- 3.447 (min: 6.380 / max: 31.090) us GPU-0: 39.539 us +/- 3.396 (min: 38.688 / max: 63.232) us
time_divmod - case 500: CPU: 6.844 us +/- 0.167 (min: 6.587 / max: 7.422) us GPU-0: 199.846 us +/- 0.593 (min: 198.752 / max: 200.704) us
time_divmod - case 1000: CPU: 6.702 us +/- 0.256 (min: 6.363 / max: 7.934) us GPU-0: 774.319 us +/- 0.461 (min: 773.888 / max: 776.064) us
There is some impact but it is around ~0.2 us for ufuncs which are the functions with less CPU time.
Unfortunately we cannot use that (which is great work nonetheless!) right away, because CuPy does not include NumPy as a build-time dependency at all and it's a big move for such a niche use case. Furthermore, this PR is ready for a quick final pass/merge, and I would hate to further delay the progress here (we got pinged every week by an internal team, also neither Emilio nor I have extra bandwidth in accommodating this suggestion). Please let's table the discussion in a separate issue and move on 🙏
Yeah, we are being pinged a lot about this, so it would be great if we can merge it in the current state and work on optimizations in a follow up PR 😁
Thanks to all involved in the PR! Let me proceed to merge as discussed above.