GPipe-Core
GPipe-Core copied to clipboard
Support persistently mapped buffers
Right now, the only way to use buffers is to do a round-trip through []
and copy the contents into the buffer one by one. For streaming large amounts of data, this can be very inefficient.
It would be beneficial if it was possible to directly map a persistent buffer binding (as a Ptr ()
or otherwise) so I can do fancy things like decoding my data into the mapped buffer, thus avoiding the extra round-trip, memory copy, (and garbage collection).
Perhaps I'm misunderstanding your proposal, but doesn't a program need to
copy from any data structure through OpenGL to the buffer in the GPU's
onboard memory? Whether that data structure is [ ]
or Ptr ..
it will
still be copied.
@plredmond Not quite. With an OpenGL PBO (or OpenCL/Vulkan/CUDA mapped buffer), the GPU's DMA engine can directly upload the data into device memory without needing to relocate it in host memory first.
Say you implement a video player, and you have an external library like libavcodec which can decode the individual frames directly into Ptr
of your choice. There are now two distinct possibilities:
-
libavcodec
decodes the data into aPtr Word8
, normally this will be a buffer that libavcodec internallymalloc
s. - You either
memcpy
this into an OpenGL mapped PBO (which will be backed by DMA-visible pinned memory), or callglTexture2D
on it and the OpenGL driver will internally copy it into a DMA buffer for you. - The GPU's DMA engine sees the data and can begin streaming the contents into device memory.
But this requires an extra memcpy
/ indirection. If you map the PBO persistently (MAP_PERSISTENT_BIT
) then you can do the following:
- You tell
libavcodec
to decode into your pre-mappedPtr Word8
instead of allocating its own internal buffer - The decoded data is already pinned / DMA-visible and all you need to do is flush the affected memory range and the GPU can start streaming from it.
Depending on the use case, the extra memcpy
can be needlessly wasteful or even problematic (if the memory was swapped, compressed, on a wrong NUMA node or otherwise not immediately accessible, you can get nasty pipeline stalls because of this), so it would be nice if we could figure out some way of avoiding it.
Unfortunately, all of this requires pretty much breaking the Haskell “safety” and dealing with direct pointers, buffers, flushing and fences (for synchronization), so I'm not sure if it can fit into the high-level GPipe API.
I guess what I'm missing in general is the ability to use GPipe at a high level but “bypass” it to insert underlying raw calls if I promise “I Know What I'm Doing(tm)”, such as the ability to define my own raw GLSL function calls. Maybe that's the over-arching problem here?
Giving up all of GPipe's exceptional ease-of-use just for the ability to make one raw gl
call is a bit daunting.
For PBO's to be useful you would need asynchronous upload, which is hard to do in a safe way (see #40). But even without PBO's, Buffers in GPipe are already "persistant" in the sense that they live on the GPU. When writing to a Buffer in GPipe you are actually using the DMA engine, and subsequent calls that are not data-dependent on the buffer you just wrote would be running concurrently by your OpenGl driver. That you are using a [] instead of directly poking a Ptr doesnt change that, just as @plredmond commented.
So, if you can get libavcodec to provide a [] instead of giving it a Ptr (ie let GPipe pull the data instead of libavcodec push it), you can get rid of the extra memcpy that way. Still, the decoding is indeed happening in sync with other GPipe calls but to alleviate that we would need #40. Does that work for you?