wgpu-py icon indicating copy to clipboard operation
wgpu-py copied to clipboard

Fastest way to read storage buffer and copy to system RAM

Open kushalkolar opened this issue 5 months ago • 4 comments

We're working on the jpeg encoder and we're wondering what's that fastest known way to read a storage buffer from the GPU and bring it to system RAM. If I understand GPUQueue.read_buffer correctly, the copy step is not ideal and also a new buffer is allocated on system RAM every time you call this function. I guess it would be faster if you always have a specific system memory location you're writing to when you're downloading new data from the GPU?

For context this buffer contains the run-length encoded data.

@apasarkar

kushalkolar avatar Jul 13 '25 02:07 kushalkolar

Some tips from me:

  1. Pre-allocate.
  2. Pinned memory is generally "required" for PCIe bus transfers. So Cuda provides these allocators: https://docs.cupy.dev/en/stable/user_guide/memory.html if you don't do this, the kernel will "request that memory" to be pinned during the transfer. anything the kernel has to do adds "time".
  3. Align the memory to 4096 bytes on the CPU side.
  4. benchmark, but make sure your computer is'nt full of "cache" when you do....

hmaarrfk avatar Jul 13 '25 14:07 hmaarrfk

Thanks! Do you have an example of this? I don't see how this can be done with the wgpu API.

kushalkolar avatar Jul 13 '25 17:07 kushalkolar

https://github.com/pygfx/pygfx-benchmarks has buffer upload tests and I believe the summary was in a different issue. I doubt any of those findings will apply to downloading a buffer, but might have some insight.

As read_buffer isn't official API, perhaps you can look at the implementation and so where shortcuts can be made for example using read_mapped directly: https://github.com/pygfx/wgpu-py/blob/a3c18d96d51d370039953db4b169cd9ca5d0ad20/wgpu/backends/wgpu_native/_api.py#L2483 Perhaps something like avoiding all the checks might already be a speedup in really time critical situations. Doing benchmarking and/or profiling is required to know and it will be very hardware specific. I would assume that on mobile systems where memory is already shared this would be faster for example.

I am slightly interested in this topic as I am looking into video recording/exporting right now and the naive solutions all require like multiple steps of copying to CPU. It feels wasteful and should theoretically be possible, especially with hardware encoders on the GPU. (Btw there might even be jpeg hardware encoders on GPU, but I don't think wgpu has any access to media apis right now)

Vipitis avatar Jul 13 '25 19:07 Vipitis

You can use this to allocate aligned arrays

import numpy as np

alignment = 4096
size = 1024  # replace with your desired size in bytes

buf = np.zeros(size + alignment, dtype=np.uint8)
start_index = -buf.ctypes.data % alignment
aligned_array = buf[start_index:start_index + size]

pinned memory required kernel support, which is somewhat against the "kernel abstractionmodel" so you can use nvidia API to allocate it if you need. I have found that that is the least important part.

But aligned arrays are worth it.

hmaarrfk avatar Jul 14 '25 00:07 hmaarrfk