zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Device support in `zarr-python` (especially for GPU)

Open nenb opened this issue 1 year ago • 9 comments

Problem I would like to load zarr data directly onto non-CPU devices (especially GPU). The current approach appears to rely on using cupy to load onto cupy-supported devices e.g. https://github.com/rapidsai/kvikio/blob/branch-25.02/notebooks/zarr.ipynb.

Unfortunately, there are a number of devices that are not supported by cupy e.g. I don't believe that my Apple Metal GPU is supported. This means that I must load from zarr via CPU if I would like to use these devices e.g. zarr on disk -> numpy -> torch (which has Metal support).

This is slow(er) and I don't believe is necessary from the zarr specification alone (?).

Background Multi-device support is a very important requirement in the AI/ML community. I would like to use zarr (and specifically the Python implementation) to run models such as LLMs on multiple devices. The quicker it is to load the model onto device (and with reduced memory usage etc), the better the UX and developer experience is.

Questions

  1. Is cupy the correct/only way to load direct to GPU with zarr-python?
  2. Is there/will there be any way of loading direct to devices such as Metal with zarr-python?
  3. (Related) What is the best way to load a PyTorch neural network on GPU with zarr-python? Is it cupy and then using something like dlpack for zero-copy exchange? Are there alternatives?

Related issues https://github.com/zarr-developers/zarr-python/pull/1967 https://github.com/zarr-developers/zarr-python/issues/2574

cc @jhamman (as suggested by @TomNicholas)

nenb avatar Jan 06 '25 19:01 nenb

CuPy/kvikio relies on nvidia's GPUDirect storage (GDS) driver and goes through PCIe. Metal GPUs are using unified memory, so CPU-to-GPU transfer can in theory be almost zero-cost (passing an address). If there is a way to pass the ownership of an array from CPU to GPU, nothing needs to be done in zarr unless there is need for GPU-accelerated decompression.

In practice though, at least torch implements the to("mps") method by cloning the tensor (memcpy-ish cost), and each ML framework may do different things. Another reference point is jax, which implements (experimental) serialization to zarr using tensorstore.

ziw-liu avatar Jan 08 '25 17:01 ziw-liu

Thanks for the pointers, @ziw-liu - your comment sent me down a rabbit-hole and I think I’ve now got a more concrete proposal worth floating.

TL;DR

  • Goal: let zarr-python decode straight into MLX arrays so that data arriving from disk is already resident in Apple-Silicon unified memory and instantly visible to both CPU and GPU.
  • How: add a UnifiedMemoryBuffer (name TBD) that satisfies zarr-python's Buffer abstraction and is backed by MLX arrays under the hood.
  • Why it matters: MLX is basically the (zero-copy, GPU-aware) “NumPy for Apple devices.” Bridging Zarr → MLX lets Mac users stream huge model weights or datasets directly to unified memory and also hopefully continues to grow the Zarr community.

1 Background – why MLX is special

  • Unified Memory
    On Apple Silicon the CPU and GPU share the same memory. Frameworks that understand this (MLX) can treat “device transfers” as metadata operations only. There is an issue by one of the framework authors which describes well why such a new framework was introduced.

  • PyTorch’s current limitation
    AFAIK (based on @ziw-liu comment above and my further research) torch.Tensor.to("mps") still performs a clone because its device model was designed around discrete GPUs. That removes a lot of the benefits of unified memory.

  • Enter MLX
    MLX (docs, blog) wraps a fairly thin C++/Metal runtime in an almost-NumPy API. Because of this, MLX feels like the natural target array type for a Zarr buffer on Macs.

2 What I’m proposing

A new Buffer implementation specifically for macs with unified memory.

3 Questions

  • There are many gaps in my knowledge of zarr-python. I would appreciate any comments from those more familiar about why this might not be a good idea/might not work as I hope!

  • The potential of Unified Memory to facilitate shared CPU-GPU data access seems particularly relevant to Zarr codecs, especially with the ongoing exploration of GPU-based decompression to alleviate CPU bottlenecks in ML workflows. For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

4 Final Notes

There are a lot more subtleties to implementing this than I have outlined in this issue. But I wanted to start an initial discussion here to get some feedback, before proceeding to a PoC.

I know there’s a lot I don’t know about Zarr internals. Any pointers, pitfalls, or “please don’t do it this way” comments are very welcome!

nenb avatar May 01 '25 00:05 nenb

A new Buffer implementation specifically for macs with unified memory.

I think the idea behind the buffer API design was to support exactly this strategy, so it looks like the right direction to me!

d-v-b avatar May 01 '25 08:05 d-v-b

One other thing to think through is the config system (docs: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html). We currently have a high-level zarr.config.enable_gpu() that updates a few config settings (the default buffer type being the big one). At least at the moment that's tied directly to CUDA / cuPy / NVIDIA GPUs. We'll need to figure out whether we want to try to have "gpu" mean "figure out stuff at runtime, based on the resources available". That sounds a bit complicated so for now I'd recommend namespacing everything under "mlx".

For those with deep knowledge of the Zarr codec pipeline, I'd greatly appreciate any considerations or challenges I should be aware of when (potentially) exploring GPU-accelerated decompression within a Unified Memory context

I'm not an expert, but am starting to dig into it as part of #2904. It's pretty challenging... At least for NVIDIA GPUs, I think we might need finer-grained controls over the input and output buffers are for each stage of the pipeline. Maybe that's not an issue with the unified memory model though.

TomAugspurger avatar May 01 '25 17:05 TomAugspurger

3. (Related) What is the best way to load a PyTorch neural network on GPU with zarr-python? Is it cupy and then using something like dlpack for zero-copy exchange? Are there alternatives?

Could we standardize on dlpack actually? Zarrs has moved in the direction of dlpack (see https://github.com/zarrs/zarrs/issues/113), and it seems like MLX supports dlpack already - https://github.com/ml-explore/mlx/issues/1080. We can make dlpack the common in-memory format for Arrays/Tensors, much like what Arrow has done for Tables/Dataframes (xref https://stackoverflow.com/questions/77347905/what-are-the-aims-for-the-apache-arrow-tensor-extension-types/77351212#77351212)

Edit: I've also got DLPack working for TIFFs at https://github.com/weiji14/cog3pio/pull/32, and have a longer-term goal of making TIFF->DLPack->CuPy work at https://github.com/weiji14/cog3pio/issues/26. I do think DLPack is what we should be using for cross-device (MLX/CUDA/ROCm) and cross-format (Zarr/TIFF/HDF) zero-copy.

weiji14 avatar May 24 '25 21:05 weiji14

Could we standardize on dlpack actually?

What would this mean in practice? What would AsyncArray.getitem return? Currently, it returns an NDArrayLike, whose concrete type depends on the buffer type being used (either a numpy ndarray for cpu or a cupy ndarray for gpu, both of which implement the dlpack interface I think).

TomAugspurger avatar Jun 25 '25 14:06 TomAugspurger

Could we standardize on dlpack actually?

What would this mean in practice? What would AsyncArray.getitem return? Currently, it returns an NDArrayLike, whose concrete type depends on the buffer type being used (either a numpy ndarray for cpu or a cupy ndarray for gpu, both of which implement the dlpack interface I think).

If I'm not mistaken, AsyncArray would need to implement __dlpack__() and __dlpack_device__() methods according to https://dmlc.github.io/dlpack/latest/python_spec.html#syntax-for-data-interchange-with-dlpack. The __dlpack__() method would return a PyCapsule object that contains a DLManagedTensor, and __dlpack_device__() would return a tuple (device_type, device_id), where device_type is 1 for CPU, 2 for CUDA, 8 for METAL, etc.

Note: It might be good to reference the implementation in https://github.com/zarrs/zarrs/pull/155, I'm not 100% sure if we're looking at the right level of abstraction, i.e. if AsyncArray is the correct place to implement these dlpack methods.

Once that is done, then any array library that implements from_dlpack() can consume the Zarr array in a zero-copy-ish way, meaning the code would look like:

array : AsyncArray = ...

ndarray = numpy.from_dlpack(array)
cparray = cupy.from_dlpack(array)
tensor = torch.from_dlpack(array)

There might be a bit more code needed in zarr-python for direct reads to Metal GPUs (i.e. to MLX tensors), but I'd say it should be done with the DLPack protocol in mind.

weiji14 avatar Jun 25 '25 21:06 weiji14

AsyncArray would need to implement

OK, yeah that's what I was wondering. There's an issue somewhere (couldn't find it in a search) about Array.__getitem__ / AsyncArray.getitem returning their own type, rather than a concrete type, which sounds related here.

I'll need to think through stuff a bit more, but I'm not sure what the interaction of dlpack and I/O is. I think that dlpack is typically used to represent in-memory arrays, so we'd need some kind of array type that represents an in-memory array...

TomAugspurger avatar Jun 25 '25 21:06 TomAugspurger

OK, yeah that's what I was wondering. There's an issue somewhere (couldn't find it in a search) about Array.__getitem__ / AsyncArray.getitem returning their own type, rather than a concrete type, which sounds related here.

https://github.com/zarr-developers/zarr-python/discussions/1603

d-v-b avatar Jun 26 '25 07:06 d-v-b