vulkan-renderer icon indicating copy to clipboard operation
vulkan-renderer copied to clipboard

Advanced buffer/texture update mechanism

Open IAmNotHanni opened this issue 6 months ago • 5 comments

Is your feature request related to a problem?

Currently, we update buffers and textures with a simple update code where the pipeline barriers are not really batched and neither are cache flushes. We can recorder the update code and also account for the frames in flight to achieve much better performance.

Description

In a simple picture, we currently update buffers like this:

for each render module
   for each buffer in the current render module
      if(update_is_required)
         destroy_buffer()
         create_buffer(create as MAPPED) // with the new size, maybe even the type changes(?)
         if(memory is HOST_VISIBLE)
            // The allocation ended up in mappable memory and is already mapped!
            // Update mapped memory simply by using memcpy
            memcpy()
            if(memory is not HOST_COHERENT)
               flush_cache(buffer) // only required if caches are not flushed automatically (=HOST_COHERENT)
               // NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
            pipeline_barrier(buffer) // Wait for copy operation to be finished
         else // not mappable memory
            destroy_staging_buffer(); // Every buffer has a staging buffer associated with it
            create_staging_buffer(create as MAPPED);
            // Copy the data into the staging buffer
            memcpy()
            if(staging buffer memory is not HOST_COHERENT)
               flush_cache(staging_buffer); // only required if caches are not flushed automatically (=HOST_COHERENT)
            pipeline_barrier(staging_buffer) // Wait for copy operation into the staging buffer to be finished
            // we already have a command buffer in recording state here btw
            vkCmdCopyBuffer(staging_buffer, buffer) // Copy from staging buffer into the actual buffer
            pipeline_barrier(buffer) // Wait for copy operation from staging buffer into the actual buffer to finish
            // The staging buffer must stay valid until the command buffer has been submitted, it will be destroyed in next iteration automatically

A similar update mechanism is used for textures, but the main different is that they always require a staging buffer and a vkCmdCopyBufferToImage command, together with additional barriers for pipeline layout transitions.

How to improve this?

General strategy: We should im for batching calls to vkCmdPipelineBarrier as much as possible and we should also batch calls to vkFlushMappedMemoryRanges. Note that we only need to flush mapped memory ranges if we write from cpu to gpu and the memory is not HOST_COHERENT. Furthermore, if we would implement readback from gpu to cpu, we would also need to do a vkInvalidateMappedMemoryRanges! We don't support readback from gpu to cpu currently.

Note that the first step of any buffer or texture update involves a memcpy(), either because the (buffer) memory is HOST_VISIBLE and can be updated through memcpy() directly, or we need to create and fill a staging buffer for the buffer or texture update. This means we can loop through all buffers and textures and create them, perform the memcpy() for each one of them, and store the data required for a pipeline barrier after the memcpy along with the data required for vkFlushMappedMemoryRanges (if required in case the memory is not HOST_COHERENT).

After this loop (for both buffers and textures), we can place one batched call to vkCmdPipelineBarrier and one batched call to vkFlushMappedMemoryRanges.

For buffers which are HOST_VISIBLE, the update is already finished at that stage. We now need to focus on the buffers and textures which require a copy command. The buffers need one pipeline barrier after the vkCmdCopyBuffer, and the textures need two for the image layout transitions before and after calling vkCmdCopyBufferToImage.

From what I understand, we can batch the buffer memory barrier for the buffers after the vkCmdCopyBuffer with the image layout transition barrier before vkCmdCopyBufferToImage, but we can't really batch all 3 barriers into one call of vkCmdPipelineBarrier I guess (I might be wrong about this!).

This means we need to place one or two calls to vkCmdPipelineBarrier towards the end here. In total, we have batched all calls to vkCmdPipelineBarrier into 3 (or only 2?) calls, and we batched all calls to vkFlushMappedMemoryRanges into only one call! This should significantly improve performance.

The final code should look something like this:

vector<PipelineBarrier> batch1
vector<MappedMemoryRange> ranges1
for every rendermodule
   for every buffer in the current rendermodule
      if(update_is_required)
         destroy_buffer()
         create_buffer()
         if(memory is HOST_VISIBLE)
            memcpy()
            // NOTE: If we would support readback from gpu to cpu, we need invalidate_mapped_memory_ranges here!
            ranges1.add(buffer_range)
            batch1.add(buffer_barrier)
         else
            destroy_staging_buffer()
            create_staging_buffer()
            memcpy()
            ranges1.add(staging_buffer_range)
            batch1.add(staging_buffer_barrier)

   for every texture in the current rendermodule
      if(update_is_required)
         destroy_staging_buffer()
         create_staging_buffer()
         memcpy()
         ranges1.add(staging_buffer_range)
         batch1.add(staging_buffer_barrier) // This is the image layout transtion barrier to transfer dst really

// Both is batched for all buffers and textures in all rendermodules, should be performant!
vkFlushMappedMemoryRanges(ranges1)
vkCmdPipelineBarriers(batch1)

vector<PipelineBarrier> batch2
for every rendermodule
   for every buffer in the current rendermodule
      if(update_is_required)
         if(not HOST_VISIBLE) // TODO: We should store the indices of buffers which require update this way earlier already...
            // This is where we left off, we created the staging buffer
            vkCmdCopyBuffer(staging_buffer, buffer)
            batch2.add(buffer_memory_barrier)
   
   for every texture in the current rendermodule
      if(update_is_required) // TODO: Remember earlier which textures need an update this way, store indices?
         vkCmdCopyBufferToImage(´staging_buffer, image)
         batch2.add(image_memory_barrier) // Image layout transition to shader read optimal

// Another batched call, should be very performant
vkCmdPipelineBarriers(batch2)

How does this connect to the frames in flight?

  • The create_buffer method of the buffer wrapper (and similar code in the texture wrapper) should get the current frame in flight index to access the correct buffer in the array for the current frame in flight. This happens all automatically, and not even rendergraph (or external code) should need worry about it.

Alternatives

If we keep the update mechanism as it is, we place a lot more barriers than needed.

Affected Code

The rendergraph code for buffer and texture management

Operating System

All

Additional Context

None

IAmNotHanni avatar May 21 '25 13:05 IAmNotHanni

EDIT: I guess the image layout transition barrier for the image before vkCmdCopyBufferToImage is really the buffer memory barrier after the memcpy() of the texture data into the staging buffer, but I need to read the details about this.

IAmNotHanni avatar May 21 '25 13:05 IAmNotHanni

I just realized there is vmaFlushAllocations, so we should use that when batching the calls.

IAmNotHanni avatar May 21 '25 16:05 IAmNotHanni

For coherent memory, vmaFlushAllocations (which calls vkFlushMappedMemoryRanges) isn't needed.

yeetari avatar May 23 '25 12:05 yeetari

For coherent memory, vmaFlushAllocations (which calls vkFlushMappedMemoryRanges) isn't needed.

Yes. I would make it so that we check if the memory is coherent or not and only if it's not, the memory range is added to the vector. (The same thing would apply for vkInvalidateMappedMemoryRanges if we would support memory readback form gpu to cpu)

Also, vmaFlushAllocations checks internally if vkFlushMappedMemoryRanges or vkInvalidateMappedMemoryRanges is required, and if nothing is required, it would skip it entirely.

IAmNotHanni avatar May 23 '25 12:05 IAmNotHanni

On desktop, memory is always coherent and the spec guarantees that some coherent memory is available, even on mobile.

yeetari avatar May 23 '25 15:05 yeetari

Another idea just randomly came to my mind: The rendergraph will automatically double or triple buffer all resources internally for the frames in flight, meaning that we will revolver through a set of buffers which are all associated with a frame index, right? Does this mean that we could basically allocate one memory pool in VMA for each frame index and then place one big memory barrier which spans over the entire pool? I am not sure if what I'm describing makes sense, let alone if that would really improve performance, but this would be the next logical step after thinking about batching all memory barriers for the updates of one frame into one call to vkCmdPipelineBarrier.

The idea behind this is that no rendering can even start unless the data for that frame has finished uploading. The details are more complex of course. It could be that placing one big barrier actually destroys granularity, because some operations could already start earlier once update has finished. Maybe batching all buffer updates is not a good idea in these scenarios? One would have to evaluate which operations on the gpu can start based on which data updates have finished that are required for work on gpu.

IAmNotHanni avatar Nov 04 '25 23:11 IAmNotHanni