[WIP] Implement DLPack
Add first-class support for zero-copy data exchange with ROCm and SYCL GPUs via DLPack interfaces.
Specs:
- https://dmlc.github.io/dlpack/latest/python_spec.html#implementation
- https://github.com/dmlc/dlpack/blob/v1.1/include/dlpack/dlpack.h
Note: we might want to implement a slightly older DLPack version if we do not want to bump up NumPy/CuPy/PyTorch/... to very recent versions. Do we have access to the 2025 Intel Python tools release on Aurora?
Close #9
Action Items
- [x] start by vibing while preparing dinner, then manually:
- [x] review and finish Array4
- [ ] PODVector
- [ ] Vector
- [ ] ArrayOfStructs
- [ ] BaseFab
- [ ] SmallMatrix
- [x] SYCL: Implement
.to_dpnp/.to_dpctlhelper functions - [x] Update
.to_xpfunctions to use.to_dpnpor.to_dpctlfor SYCL GPUs - [x] Test on CUDA GPU
- [ ] Test on ROCm GPU
- [x] Test on SYCL GPU (help wanted)
- [ ] Search docs for needed updates.
- [ ] Fix DLPack stubs https://github.com/sizmailov/pybind11-stubgen/pull/258 or bind manually in pyAMReX
I performed some testing of the new functionality on Perlmutter. After the latest commit, the following appears to work as intended:
def test_mfab_cuda_cupy(mfab_device):
import cupy as cp
# AMReX -> cupy
for mfi in mfab_device:
marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
marr_cupy_from_dlpack[0, 1, 3, 2] = 5
for mfi in mfab_device:
marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
print(marr_cupy_from_dlpack[0, 1, 3, 2])
It executes without failure and prints the modified value 5. Inspection of the DLDevice showed that the device was successfully identified as kDLCUDA. The device id returned 3, which is consistent with Perlmutter's standard rank-to-gpu mapping with just one MPI rank.
Awesome, then we are nearly there.
Try the dpnp logic for SYCL next?
I tested the dlpack functionality on Aurora (SYCL) and it also now produces the expected result. I also modified Array4_to_xp to take into account the GPU backend. We can now successfully access a MultiFab's Array4 from a SYCL device with
for mfi in mfab_device:
mfab_device.array(mfi).to_dpnp()
I compiled WarpX on Aurora using this pyamrex branch. With it I was able to successfully run a multi-GPU simulation that uses fields.py to read MultiFab values 🎉 🚀
We need to rebase against development after #455 was merged. I already added the DLDeviceType bindings now and the other PR adds capsule type hints.