GPipe-Core icon indicating copy to clipboard operation
GPipe-Core copied to clipboard

Support persistently mapped buffers

Open haasn opened this issue 7 years ago • 3 comments

Right now, the only way to use buffers is to do a round-trip through [] and copy the contents into the buffer one by one. For streaming large amounts of data, this can be very inefficient.

It would be beneficial if it was possible to directly map a persistent buffer binding (as a Ptr () or otherwise) so I can do fancy things like decoding my data into the mapped buffer, thus avoiding the extra round-trip, memory copy, (and garbage collection).

haasn avatar Jul 05 '17 10:07 haasn

Perhaps I'm misunderstanding your proposal, but doesn't a program need to copy from any data structure through OpenGL to the buffer in the GPU's onboard memory? Whether that data structure is [ ] or Ptr .. it will still be copied.

ghost avatar Jul 08 '17 20:07 ghost

@plredmond Not quite. With an OpenGL PBO (or OpenCL/Vulkan/CUDA mapped buffer), the GPU's DMA engine can directly upload the data into device memory without needing to relocate it in host memory first.

Say you implement a video player, and you have an external library like libavcodec which can decode the individual frames directly into Ptr of your choice. There are now two distinct possibilities:

  1. libavcodec decodes the data into a Ptr Word8, normally this will be a buffer that libavcodec internally mallocs.
  2. You either memcpy this into an OpenGL mapped PBO (which will be backed by DMA-visible pinned memory), or call glTexture2D on it and the OpenGL driver will internally copy it into a DMA buffer for you.
  3. The GPU's DMA engine sees the data and can begin streaming the contents into device memory.

But this requires an extra memcpy / indirection. If you map the PBO persistently (MAP_PERSISTENT_BIT) then you can do the following:

  1. You tell libavcodec to decode into your pre-mapped Ptr Word8 instead of allocating its own internal buffer
  2. The decoded data is already pinned / DMA-visible and all you need to do is flush the affected memory range and the GPU can start streaming from it.

Depending on the use case, the extra memcpy can be needlessly wasteful or even problematic (if the memory was swapped, compressed, on a wrong NUMA node or otherwise not immediately accessible, you can get nasty pipeline stalls because of this), so it would be nice if we could figure out some way of avoiding it.

Unfortunately, all of this requires pretty much breaking the Haskell “safety” and dealing with direct pointers, buffers, flushing and fences (for synchronization), so I'm not sure if it can fit into the high-level GPipe API.

I guess what I'm missing in general is the ability to use GPipe at a high level but “bypass” it to insert underlying raw calls if I promise “I Know What I'm Doing(tm)”, such as the ability to define my own raw GLSL function calls. Maybe that's the over-arching problem here?

Giving up all of GPipe's exceptional ease-of-use just for the ability to make one raw gl call is a bit daunting.

haasn avatar Jul 08 '17 22:07 haasn

For PBO's to be useful you would need asynchronous upload, which is hard to do in a safe way (see #40). But even without PBO's, Buffers in GPipe are already "persistant" in the sense that they live on the GPU. When writing to a Buffer in GPipe you are actually using the DMA engine, and subsequent calls that are not data-dependent on the buffer you just wrote would be running concurrently by your OpenGl driver. That you are using a [] instead of directly poking a Ptr doesnt change that, just as @plredmond commented.

So, if you can get libavcodec to provide a [] instead of giving it a Ptr (ie let GPipe pull the data instead of libavcodec push it), you can get rid of the extra memcpy that way. Still, the decoding is indeed happening in sync with other GPipe calls but to alleviate that we would need #40. Does that work for you?

tobbebex avatar Aug 09 '17 20:08 tobbebex