pyamrex icon indicating copy to clipboard operation
pyamrex copied to clipboard

[WIP] Implement DLPack

Open ax3l opened this issue 5 months ago • 5 comments

Add first-class support for zero-copy data exchange with ROCm and SYCL GPUs via DLPack interfaces.

Specs:

  • https://dmlc.github.io/dlpack/latest/python_spec.html#implementation
  • https://github.com/dmlc/dlpack/blob/v1.1/include/dlpack/dlpack.h

Note: we might want to implement a slightly older DLPack version if we do not want to bump up NumPy/CuPy/PyTorch/... to very recent versions. Do we have access to the 2025 Intel Python tools release on Aurora?

Close #9

Action Items

  • [x] start by vibing while preparing dinner, then manually:
  • [x] review and finish Array4
  • [ ] PODVector
  • [ ] Vector
  • [ ] ArrayOfStructs
  • [ ] BaseFab
  • [ ] SmallMatrix
  • [x] SYCL: Implement .to_dpnp / .to_dpctl helper functions
  • [x] Update .to_xp functions to use .to_dpnp or .to_dpctl for SYCL GPUs
  • [x] Test on CUDA GPU
  • [ ] Test on ROCm GPU
  • [x] Test on SYCL GPU (help wanted)
  • [ ] Search docs for needed updates.
  • [ ] Fix DLPack stubs https://github.com/sizmailov/pybind11-stubgen/pull/258 or bind manually in pyAMReX

ax3l avatar Jul 23 '25 00:07 ax3l

I performed some testing of the new functionality on Perlmutter. After the latest commit, the following appears to work as intended:

def test_mfab_cuda_cupy(mfab_device):
    import cupy as cp

    # AMReX -> cupy
    for mfi in mfab_device:   
        marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
        marr_cupy_from_dlpack[0, 1, 3, 2] = 5

    for mfi in mfab_device:   
        marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
        print(marr_cupy_from_dlpack[0, 1, 3, 2])

It executes without failure and prints the modified value 5. Inspection of the DLDevice showed that the device was successfully identified as kDLCUDA. The device id returned 3, which is consistent with Perlmutter's standard rank-to-gpu mapping with just one MPI rank.

roelof-groenewald avatar Jul 25 '25 06:07 roelof-groenewald

Awesome, then we are nearly there.

Try the dpnp logic for SYCL next?

ax3l avatar Jul 25 '25 18:07 ax3l

I tested the dlpack functionality on Aurora (SYCL) and it also now produces the expected result. I also modified Array4_to_xp to take into account the GPU backend. We can now successfully access a MultiFab's Array4 from a SYCL device with

for mfi in mfab_device:
    mfab_device.array(mfi).to_dpnp()

roelof-groenewald avatar Jul 26 '25 06:07 roelof-groenewald

I compiled WarpX on Aurora using this pyamrex branch. With it I was able to successfully run a multi-GPU simulation that uses fields.py to read MultiFab values 🎉 🚀

roelof-groenewald avatar Jul 26 '25 07:07 roelof-groenewald

We need to rebase against development after #455 was merged. I already added the DLDeviceType bindings now and the other PR adds capsule type hints.

ax3l avatar Jul 28 '25 16:07 ax3l