Nabla icon indicating copy to clipboard operation
Nabla copied to clipboard

[WIP] Descriptor lifetime tracking & Command Pool revamp

Open achalpandeyy opened this issue 2 years ago • 7 comments

Description

Testing

TODO list:

  • [x] OpenGL command buffer execution (COpenGLCommandBuffer::executeAll).
  • [x] Merge master and get things working.
  • [x] calc_size method is not needed to be defined explicitly for "fixed size commands". By "fixed size commands", we mean the commands whose size is know at compile-time. For example: CDrawIndexedIndirectCountCmd and CBeginRenderPassCmd (among many others) are fixed size commands, whereas CPipelineBarrierCmd is not --because the number of resources the command operates on, is not known at compile-time.
  • [x] Fix the inversion of control happening between IGPUCommandPool and IGPUCommandBuffer. Related comment.
  • [x] Have two separate segmented lists for non Vulkan-like backends i.e. those rendering backends which need to hold on to some data other than smart_refctd_ptr of descriptors to execute its commands, such as the OpenGL backend. Where the first segmented list will hold the smart_refctd_ptr of descriptors (common to both Vulkan-like and non Vulkan-like backends) and the second one will hold on to other data. More details in this comment.
  • [ ] Command Pool Resetting.
    • [x] Turn all ICommandBuffer methods into pure virutals.
    • [x] Hoist all state transition and validation to IGPUCommandBuffer. Backend specific _impl functions would return void.
    • [x] Detect if the command pool of a command buffer has been already reset with the help of counters in IGPUCommandPool and IGPUCommandBuffer as mentioned in #373 and as already being done in #408 (works like a versioning system), and do NOT release resources in this case.
    • [ ] Check for command pool reset before every command buffer method.
    • [x] Hoist releaseResourcesBackToPool to IGPUCommandBuffer. Context: #408.
    • [ ] IGPUCommandBuffer's last-reset-stamp/counter should be set in begin and not in resetCommon as currently being done in #408.
    • [ ] Figure out a way to release the command segment list memory on pool reset without keeping track of child command buffer heads.
  • [x] OpenGL Command Buffer Baking.
    • [x] When executing certain pre-recorded GL commands it is required to have access to the FBO, VAO and Pipeline caches. If we store these caches in COpenGL_Queue::CThreadHandler::ThreadInternalStateType then they can be just passed along to COpenGLCommandBuffer::executeAll called in CThreadHandler::process_request.
      • [x] FBO cache.
      • [x] VAO cache.
      • [x] Pipeline cache.
  • [x] Test and then remove the old command pool system.
    • [x] Make Example 1 work.
    • [x] Make Example 2 work.
    • [x] Make Example 6 work.
    • [x] Make Example 7 work.
    • [x] Add more commands.
      • [x] ~CPipelineBarrierCmd has a crash bug for some reason.
      • [x] Implement COpenGLCommandBuffer::bindComputePipeline. (SOpenGLState::pipeline::compute can no longer have a usedShader member.
      • [x] This (private Discord link) is possibly related.
      • [x] Add the commands which were not already added as the by product of making the aforementioned examples work.
    • [x] Remove the old command pool system.
  • [ ] Descriptor Lifetime Tracking

achalpandeyy avatar Apr 14 '22 15:04 achalpandeyy

Review of the Interfaces

Bug

Using std::forward more than once: https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r863819861

Command segment won't really be 128k: https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r863768889

Design Failure

There's a inversion of control going on, IGPUCommandBuffer is calling methods that IGPUCommandPool should keep protected or even private.

Therefore IGPUCommandBuffer::emplace and possibly even the deleteCommandSegmentList (as a function taking a head segment parameter) should move to IGPUCommandPool.

Quality of Life improvements

Common CRTP base to provide calc_size

To avoid having to type out the same definition of calc_size for every fixed size command, you could have

template<class CRTP>
class NBL_FORCE_EBO IGPUCommandPool::IFixedSizeCommand
{
    public:
        template<typename... Args>
        static uint32_t calc_size(const Args&... args)
        {
            return sizeof(CRTP);
        }
};

Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd, you could just do

class IGPUCommandPool::CDrawIndexedIndirectCountCmd : public IGPUCommandPool::CDrawIndirectCommonBase, public IGPUCommandPool::IFixedSizeCommand<IGPUCommandPool::CDrawIndexedIndirectCountCmd>
{
public:
    CDrawIndexedIndirectCountCmd(const core::smart_refctd_ptr<const video::IGPUBuffer>& buffer, const core::smart_refctd_ptr<const video::IGPUBuffer>& countBuffer)
        : CDrawIndirectCommonBase(buffer, calc_size(buffer, countBuffer)), m_countBuffer(countBuffer)
    {}
    
protected:
    core::smart_refctd_ptr<const IGPUBuffer> m_countBuffer;
};

Abandon inheritance from Interface Commands, have a separate tape for Backend Specific commands

Ok this one is my own "Design Failure", basically I thought to have the linear array of commands record for every backend, but the backend to derive (inherit) from the interface command in order to add extra data.

So when a Vulkan tape would look like this:

{cmdSize,bindPipelineGeneralParams},
{cmdSize,bindDescriptorSetsGeneralParams},
{cmdSize/*,pushConstantsGeneralEmptyStruct*/},
{cmdSize/*,dispatchGeneralEmptyStruct*/},

An OpenGL tape would look like this (via the virtues of inheritance):

{cmdSize,bindPipelineGeneralParams/*,nothingExtraNeeded*/},
{cmdSize,bindDescriptorSetsGeneralParams/*,nothingExtraNeeded*/},
{cmdSize/*,pushConstantsGeneralEmptyStruct*/,pushConstantData},
{cmdSize/*,dispatchGeneralEmptyStruct*/,dispatchParams},

However things start to get hairy when you think about the ability to generate "pruned" command lists (such as thanks to OpenGL validation, or some things just being NOOPs on Vulkan).

If you wanted to not record certain commands because it doesn't make sense (because Vulkan doesn't need to replay them or reencode them) you end up having special checks for them.

If you start "skipping" commands that have no effect on the state, you may miss holding onto some smart_refctd_ptr references and mess up the whole lifetime tracking of the framework (basically if you do 5 descriptor set binds in quick succession, whether your descriptor sets would be kept alive would depend on the backend you're using which would be insane).

Furthermore, its just bad for speed and cache, since you'd be skipping over a lot of data you don't need on a particular backend when iterating.

So my new proposal is to make all the interface commands final and for non-Vulkan backends, have two tapes (but still allocated from the same pool, just not sharing segments) one in IGPUCommandBuffer which is the "general tape" for keeping refcounted pointers alive, and another in the derived backend class just for storing backend specific command (to iterate over when executeing). The backend specific commands would inherit directly from ICommand as usual (and IFixesSizeCommand whenever applicable).

So then our General tape looks like this (only cares about lifetime tracking for Vulkan-like API):

{cmdSize,bindPipelineGeneralParams},
{cmdSize,bindDescriptorSetsGeneralParams}

and it doesn't need to care about NOOPs such as draw, dispatch, etc.

And our OpenGL tape would just care about storing the necessary info for when exectuteAll is ran, so it wouldn't need to hold onto any smart_refctd_ptr at all.

{cmdSize,bindTexturesCmd},
{cmdSize,bindSSBOsCmd},
{cmdSize,pushConstantData},
{cmdSize,dispatchParams}

then the commands in a GL tape wouldn't need to correspond 1:1 with the commands in the general tape.

Therefore IGPUCommandBuffer::emplace and possibly even the deleteCommandSegmentList (as a function taking a head segment parameter) should move to IGPUCommandPool.

For these:

m_cmdpool->m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));

and

m_cmdpool->m_commandSegmentPool.deallocate(currSegment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);

can I make wrappers:

CommandSegment* IGPUCommandPool::allocateCommandSegment()
{
    return m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));
}

and

void IGPUCommandPool::deallocateCommandSegment(CommandSegment* segment)
{
    m_commandSegmentPool.deallocate(segment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);
}

and not move both of the functions IGPUCommandBuffer::emplace and IGPUCommandBuffer::deleteCommandSegmentList to IGPUCommandPool? If we do this, we wouldn't have to pass the redundant parameters (like m_segmentListHeadItr and m_segmentListTailItr) every time we emplace a new command in C{Vulkan/OpenGL}CommandBuffer functions, like:

bool bindIndexBuffer(..)
{
    ..
    m_cmdpool->emplace<IGPUCommandPool::CBindIndexBufferCmd>(m_segmentListHeadItr, m_segmentListTail, /*other required params*/);
    ..
}

and we can do:

bool bindIndexBuffer(..)
{
    ..
    emplace<IGPUCommandPool::CBindIndexBufferCmd>(/*other required params*/);
    ..
}

achalpandeyy avatar Jul 25 '22 06:07 achalpandeyy

OpenGL Command Buffer Baking

What is happening now

  • SOpenGLContextLocalCache is cached in the command buffer's COpenGL_Queue's thread internal state (ThreadInternalStateType).
  • The cached SOpenGLContextLocalCache is modified (eventually to reflect new GL context state on "flush") at command-buffer-execution-time.
  • "Flushing" (by functions such as, flushStateGraphics and flushStateCompute etc) calls GL functions immediately.
  • SCmd<> structures are analogues to vkCmd*.
  • Non-shareable GL objects such as VAO, FBO and Shader Programs (because of cached uniform state) are cached inside SOpenGLContextLocalCache inside queue's thread's internal state.

What we want

  • SOpenGLContextLocalCache is cached in the command buffer itself.
  • The cached SOpenGLContextLocalCache is modified (eventually to reflect new GL context state on "flush") at command-buffer-record-time. This comment highlights the difference.
  • "Flushing" (by functions such as, flushStateGraphics and flushStateCompute etc) records GL functions in the command buffer to execute them later when the command buffer executes. This is important to not let the GL context get out of sync with what the currently executing command expects.
  • In light of the point above, SCmd<> structures are analogues to GL functions, such as glBindVertexArray (with slight renaming in accordance with this) such that they can be recorded in the command buffer when a "flush" function is called.
  • According to the very first point, now since the stuff in SOpenGLContextLocalCache is cached in the command buffer, non-shareable GL objects should stay in the queue's cache and thus separate form SOpenGLContextLocalCache.

achalpandeyy avatar Jul 30 '22 03:07 achalpandeyy

Therefore IGPUCommandBuffer::emplace and possibly even the deleteCommandSegmentList (as a function taking a head segment parameter) should move to IGPUCommandPool.

For these:

m_cmdpool->m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));

and

m_cmdpool->m_commandSegmentPool.deallocate(currSegment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);

can I make wrappers:

CommandSegment* IGPUCommandPool::allocateCommandSegment()
{
    return m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));
}

and

void IGPUCommandPool::deallocateCommandSegment(CommandSegment* segment)
{
    m_commandSegmentPool.deallocate(segment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);
}

and not move both of the functions IGPUCommandBuffer::emplace and IGPUCommandBuffer::deleteCommandSegmentList to IGPUCommandPool? If we do this, we wouldn't have to pass the redundant parameters (like m_segmentListHeadItr and m_segmentListTailItr) every time we emplace a new command in C{Vulkan/OpenGL}CommandBuffer functions, like:

bool bindIndexBuffer(..)
{
    ..
    m_cmdpool->emplace<IGPUCommandPool::CBindIndexBufferCmd>(m_segmentListHeadItr, m_segmentListTail, /*other required params*/);
    ..
}

and we can do:

bool bindIndexBuffer(..)
{
    ..
    emplace<IGPUCommandPool::CBindIndexBufferCmd>(/*other required params*/);
    ..
}

all suggestions are fine I guess

OpenGL Command Buffer Baking

What is happening now

  • SOpenGLContextLocalCache is cached in the command buffer's COpenGL_Queue's thread internal state (ThreadInternalStateType).
  • The cached SOpenGLContextLocalCache is modified (eventually to reflect new GL context state on "flush") at command-buffer-execution-time.
  • "Flushing" (by functions such as, flushStateGraphics and flushStateCompute etc) calls GL functions immediately.
  • SCmd<> structures are analogues to vkCmd*.

well not really analogues, but they keep all arguments necessary to carry out the command's equivalent in a deferred way on another thread.

  • Non-shareable GL objects such as VAO, FBO and Shader Programs (because of cached uniform state) are cached inside SOpenGLContextLocalCache inside queue's thread's internal state.

What we want

  • SOpenGLContextLocalCache is cached in the command buffer itself.

Also only used/relevant between for methods called between a begin and end.

  • The cached SOpenGLContextLocalCache is modified (eventually to reflect new GL context state on "flush") at command-buffer-record-time. This comment highlights the difference.
  • "Flushing" (by functions such as, flushStateGraphics and flushStateCompute etc) records GL functions in the command buffer to execute them later when the command buffer executes. This is important to not let the GL context get out of sync with what the currently executing command expects.
  • In light of the point above, SCmd<> structures are analogues to GL functions, such as glBindVertexArray (with slight renaming in accordance with this) such that they can be recorded in the command buffer when a "flush" function is called.

Wherever you have a glCall in the flush function, you should have an emplace onto the second tape and a unique struct type.

  • According to the very first point, now since the stuff in SOpenGLContextLocalCache is cached in the command buffer, non-shareable GL objects should stay in the queue's cache and thus separate form SOpenGLContextLocalCache.

Yes, so in the structs used for the above, you need to keep some data (pointer and/or hash value) needed to find or create the non-shareable GL objects.

Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd, you could just do

Wouldn't every command be just fixed size? For example, even for CPipelineBarrierCmd (which I consider could be the prime example of a "variable" size command) we can just have a core::smart_refctd_dynamic_array of images and buffers like this, and then can derive from IGPUCommandPool::IFixedSizeCommand<CRTP>.

Is this not the ideal way to go about this? Now that I think about it, it might not be the ideal way because the dynamically allocated memory via core::smart_refctd_dynamic_array would not be aligned to cache line size and may cause false sharing --which would defeat the entire purpose of making the commands align to cache line size. So we can try something like this (demonstrated for CBindDescriptorSetsCmd). It does seem a bit strange though.

Commands affected by this: CBindDescriptorSetsCmd, CPipelineBarrierCmd, CNamedBufferSubDataCmd, IGPUCommandPool::CExecuteCommands, COpenGLCommandPool::CExecuteCommands, CWaitEventsCmd, CWriteAccelerationStructurePropertiesCmd, CBuildAccelerationStructuresCmd.

achalpandeyy avatar Aug 21 '22 05:08 achalpandeyy

Descriptor Lifetime Tracking

GPU resources, referenced by descriptor sets, go out of scope and get destroyed before we get the chance to bind the descriptor set and use them. To solve this, we make the descriptor set hold on to smart_ref_ctd_ptr to these resources.

When we create a descriptor pool we declare how many descriptors of each type it can hold, at maximum, via the poolSizeCount and poolSizes params.

core::smart_refctd_ptr<IDescriptorPool> createDescriptorPool(
        IDescriptorPool::E_CREATE_FLAGS flags, uint32_t maxSets, uint32_t poolSizeCount,
        const IDescriptorPool::SDescriptorPoolSize* poolSizes) override

Note: Don't confuse it with the maxSets member which tells you how many descriptor sets can be allocated from a pool, for all intents and purposes, it can be thought of as referring to the VkDescriptorSet handles themselves --which would occupy a constant space regardless of what or how many descriptors the set references.

Using these params we can create core::smart_refctd_dynamic_array for every descriptor type and store it in the IDescriptorPool instance. IGPUDescriptorSet would then, store pointers (or offsets) into these dynamic arrays.

Note: Immutable samplers are kept alive by IGPUDescriptorSetLayout via the SBinding::samplers member. IGPUDescriptorSetLayout in turn is kept alive by the IGPUDescriptorSet, so nothing needs to be done for immutable samplers. However, for mutable samplers lifetime tracking is required, and they can be treated as just another descriptor type with their own separate memory block or refctd_dynamic_array (see Open Question 1 below). Sidenote: This should be a raw pointer because SImageInfo is a throw-away structure mainly used to update the descriptor sets and AFAICT we don't care about lifetime there.

Since we know the descriptor set layout when we create a IGPUDescriptorSet,

core::smart_refctd_ptr<IGPUDescriptorSet> createDescriptorSet_impl(IDescriptorPool* pool,
    core::smart_refctd_ptr<const IGPUDescriptorSetLayout>&& layout) override;

we can figure out the exact number of descriptors, of each type, that the set can ever use and that need to be allocated from the dynamic arrays stored in its parent pool.

Descriptor sets allocated from a descriptor pool must not be freed (via vkFreeDescriptorSets) unless the pool is created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT bit set.

Descriptor set allocation failure due to fragmentation should not happen if:

  • Descriptor pool is not created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT bit set.
  • Descriptor pool is created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT bit set BUT it has not had any descriptor sets freed since it was created or most recently reset.
  • If all sets allocated from the pool since it was created or most recently reset use the same number of descriptors (of each type) and the requested allocation also uses that same number of descriptors (of each type).

General Idea/Algorithm

When creating a IGPUDescriptorSet iterate through the given IGPUDescriptorSetLayout's bindings and depending on the descriptor type reserve space in the parent descriptor pool's descriptor storage (see Open Question 1 for more details). Store the offsets/raw pointers into the pool's descriptor storage, in IGPUDescriptorSet. When updating the descriptor set, use these offsets/raw pointers to put the actual resource's ref_ctd_ptrs into the pool's storage.

Required (and Foreseeable) Modifications to the OpenGL Backend

  1. Currently, the storage of smart_refctd_ptr<IDescriptor> comes from COpenGLDescriptorSet in the form of refctd_dynamic_arrays for each descriptor type. This will be removed as we're adding this storage in IDescriptorPool so that it remains common for all backends. All of this should move to the common base IGPUDescriptorSet.
  2. Storage for these OpenGL specific arrays should come from a (currently non-existing) COpenGLDescriptorPool? That means we pool the allocation for these arrays, for all COpenGLDescriptorSets that will ever be allocated from this pool. Again, this can be done because we know, beforehand, the exact number of descriptors of each type that will be allocated from this pool. Allocation from the pool's dynamic arrays/memory block should follow the same policies regarding allocation failure due to fragmentation as mentioned above. The SMultibindParams will be initialized in the COpenGLDescriptorSet constructor as it is happening currently. COpenGLDescriptorSet could also store offsets/raw pointers into these OpenGL specific arrays if required.
  3. TODO: Is there anything special that needs to be done to handle dynamic offsets?
  4. TODO: Things from IEmulatedDescriptorSet can be hoisted to IDescriptorSetLayout?

Special Flags (TODO)

  • VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT
  • VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT
  • VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT

Open Questions:

  1. Until we update a descriptor set we don't have the actual ref_ctd_ptr to resources that need to go into the pool's dynamic arrays, that means we need a way to mark the space reserved for a descriptor set until its updated so that other descriptor sets don't try to claim that space in the meanwhile. If we use nullptr to indicate empty space what is a good default value for reserved space?

I have some suggestions for this: A. Make refctd_dynamic_array of

struct
{
    bool isReserved;
    core::smart_refctd_ptr<const IDescriptor> descriptor;
};

instead of just core::smart_refctd_ptr<const IDescriptor>.

B. Ditch refctd_dynamic_array entirely and have a fixed size memory block for each descriptor type instead. Then either use a GeneralpurposeAddressAllocator or LinearAddressAllocator depending upon whether freeing of descriptor sets is allowed or not. This will most likely require us to do this (private Discord link). If we do end up going this route I will remove the private link and update the comment with the exact thing that needs to happen.

achalpandeyy avatar Sep 19 '22 12:09 achalpandeyy

Using these params we can create core::smart_refctd_dynamic_array for every descriptor type and store it in the IDescriptorPool instance. IGPUDescriptorSet would then, store pointers (or offsets) into these dynamic arrays.

What's the template parameter in the array?

In my opinion it would be sufficient to create a "plain" array aligned to the Backend's Final Class Descriptor Set's "binding record" and large enough to store that many multiples of it.

for example

  protected:
    struct SCombinedSampler
    {
       core::smart_refctd_ptr<IGPUImageView> view;
       core::smart_refctd_ptr<IGPUSampler> sampler;
    };
    
    // a struct that wont trigger any ctors/dtors etc. also facilitates pointer arithmetic
    template<typename T>
    struct alignas(T) StorageTrivializer
    {
      uint8_t storage[sizeof(T)];
    };
    
    std::unique_ptr<StorageTrivializer<SCombinedSampler>[]> combinedSamplerStorage;
    std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUImageView>>[]> imageStorage; // storage, and input attachment
    std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUBufferView>>[]> UTB_STBstorage;
    std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUBuffer>>[]> UBO_SSBOStorage; // includes dynamics

The arrays for storing auxiliary backend stuff like SMultibindParams you can allocate and manage separately in the derived class.

You can however generalize and precompute offsets from binding to storage index (you can query via binary search/lower bound) for each descriptor set layout.

Then to map from DS + binding + type + index in binding array -> storage index in pool all you need to do is binary search that offset within a DS and add the DS's storage offset with the pool.

Hint, maybe hoist these? I don't remember what they do but seems similar https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/COpenGLDescriptorSet.h#L356

P.S. In the asset:E_DESCRIPTOR_TYPE enum, we should probably comment out ACCELERATION_STRUCTURE_NV, INLINE_UNIFORM_BLOCK_EXT and MUTABLE_VALVE as we don't support these. Also rename ACCELERATION_STRUCTURE_KHR to just ACCELERATION_STRUCTURE.

Note: Immutable samplers are kept alive by IGPUDescriptorSetLayout via the SBinding::samplers member.

Correct.

IGPUDescriptorSetLayout in turn is kept alive by the IGPUDescriptorSet, so nothing needs to be done for immutable samplers.

Correct.

However, for mutable samplers lifetime tracking is required, and they can be treated as just another descriptor type with their own separate memory block or refctd_dynamic_array (see Open Question 1 below).

~~Just use a struct thats essentially a pair of smart_pointer of IGPUImage and IGPUSampler, see the code snippet above.~~

SoA best

Sidenote: This should be a raw pointer because SImageInfo is a throw-away structure mainly used to update the descriptor sets and AFAICT we don't care about lifetime there.

No because desc is a smart pointer too, c.f. https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/include/nbl/asset/IDescriptorSet.h#L59

We use smart pointers for arguments because you don't need to keep the smartpointer references alive somewhere else while you build the SWriteDescriptorSet or SCopyDescriptorSet argument arrays.

Pure ergonomics, right now we can create the image/buffer/sampler and assign right away to a SDescriptorInfo struct field and not worry about lifetimes, if we made the change we'd need a whole load of temporary variables

we can figure out the exact number of descriptors, of each type, that the set can ever use and that need to be allocated from the dynamic arrays stored in its parent pool.

Yes, this means that you can create a mapping from every binding to an offset in some array that provides backing storage for some metadata structs (OpenGL does this but the storage is SoA not AoS because of how you pass parameters to OpenGL multibind, also it groups descriptors by type into a single array differently).

e.g.

struct BindingToStorage
{
  uint32_t binding;
  uint32_t storageOffset;
  // if you need to store some other offsets, then inherit from the struct and use that
  
  inline bool operator<(const BindingToStorage& other) const
  {
    return binding<other.binding;
  }
};

Then you can binary search for the binding you wish to modify/get, add the offset (for an array binding) and voila you have your storage slot.

Descriptor set allocation failure due to fragmentation should not happen if:

* Descriptor pool is not created with the `VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT` bit set.

This is trivial to satisfy, as you can use a LinearAddressAllocator (or just a pointer/cursor that you bump).

If nothing is ever freed, its impossible to fragment and fail an allocation of any storage block as long as total number of descriptors of given type is below the up-front declared maximum.

  • Descriptor pool is created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT bit set BUT it has not had any descriptor sets freed since it was created or most recently reset.

This is basically a situation where you have a brand new "clean slate" allocator and the constraint is that it must behave like a linear allocator.

Well I have good news for you, the GeneralPurposeAddressAllocator will give you exact same allocations that LinearAddressAllocator would do as long as:

  • the minimum allocation or block size is 1 (or there was no allocation less than the minimum)
  • the alignments of subsequent allocs have been the same
  • there have been no frees since the last reset

  • If all sets allocated from the pool since it was created or most recently reset use the same number of descriptors (of each type) and the requested allocation also uses that same number of descriptors (of each type).

GeneralPurposeAddressAllocator will also satisfy this requirement.

Conclusion

To allocate offsets into storage, use GeneralPurposeAddressAllocator<uint32_t> with a minimum allocation and block size of 1 if pool has the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT flag, otherwise use LinearAddressAllocator<uint32_t>.

Important points for both allocators:

  • always use an alignment of 1
  • allocate the offset for all descriptors of a given type in a pool at once (only in this way can you satisfy the fragmentation constraints of Vulkan!)

When creating a IGPUDescriptorSet iterate through the given IGPUDescriptorSetLayout's bindings and depending on the descriptor type reserve space in the parent descriptor pool's descriptor storage (see Open Question 1 for more details). Store the offsets/raw pointers into the pool's descriptor storage, in IGPUDescriptorSet. When updating the descriptor set, use these offsets/raw pointers to put the actual resource's ref_ctd_ptrs into the pool's storage.

As a corollary of my previous 4 comments, we can see that an IGPUDescriptorSet must hold the following:

core::smart_refctd_ptr<IGPUDescriptorSetLayout> m_layout; // both to keepalive and to query offsets into storage
core::smart_refctd_ptr<IGPUDescriptorPool> m_pool; // both to keepalive and to obtain memory for bindings
// offsets of the set's allocations within the pool to apply as a shift to all layout precomputed offsets
uint32_t m_combinedSamplerStorageOffset;
uint32_t m_imageViewStorageOffset;
uint32_t m_UTB_STBStorageOffset;
uint32_t m_UBO_SSBOStorageOffset;
// if you have some other arrays per-backed with extra data, then store more offsets in derived class

P.S. Its best if you precompute the offsets per-binding and store them in the DS layout, you only need to do it once.

Required (and Foreseeable) Modifications to the OpenGL Backend

1. Currently, the storage of `smart_refctd_ptr<IDescriptor>` comes from `COpenGLDescriptorSet` in the form of `refctd_dynamic_array`s for each descriptor type. This will be removed as we're adding this storage in `IDescriptorPool` so that it remains common for all backends. All of [this](https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/src/nbl/video/COpenGLDescriptorSet.h#L53-L90) should move to the common base `IGPUDescriptorSet`.

You need similar code in IDescriptorSetLayout, but this OpenGL code computes slightly different offsets actually. The reason is because OpenGL groups some descriptor types together (UTB and Combined Samplers share slot address space).

Actually the main class providing the slots and mapping to core::smart_refctd_ptr is IEmulatedDescriptorSet. So we've always had a "working" version of DLT in OpenGL, just not in VK.

Also that code should move from COpenGLDescriptorSet to COpenGLDescriptorSetLayout.

2. Storage for [these](https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/src/nbl/video/COpenGLDescriptorSet.h#L92-L128) OpenGL specific arrays should come from a (currently non-existing) `COpenGLDescriptorPool`? 

Yes.

That means we pool the allocation for these arrays, for all COpenGLDescriptorSets that will ever be allocated from this pool. Again, this can be done because we know, beforehand, the exact number of descriptors of each type that will be allocated from this pool. Allocation from the pool's dynamic arrays/memory block should follow the same policies regarding allocation failure due to fragmentation as mentioned above.

Good conclusion, as you can see that my last 5 comments concur with your line of reasoning.

   The `SMultibindParams` will be initialized in the `COpenGLDescriptorSet` constructor as it is happening currently.
   `COpenGLDescriptorSet` could also store offsets/raw pointers into these OpenGL specific arrays if required.

Yes, here's the huge pitfall we need to avoid (smartpointer circular references).

The Descriptor Pool only allocates and hands out raw bytes.

Its the Descriptor Set that must call placement new and the destructor to initialize/deinitialize the arrays of descriptor binding data.

In the Constructor:

const uint32_t bufferBindingCount = m_layout->getBufferBindingCount();
uint32_t offset = m_pool->allocateBufferBindings(bufferBindingCount);
SBufferBinding* bufferBindings = new(m_pool->getBufferBindings()+offset) SBufferBinding[bufferBindingCount];

In the Destructor:

for (auto i=0; i<m_layout->getBufferBindingCount(); i++)
  bufferBindings->~SBufferBinding();
m_pool->freeBufferBindings(bufferBindingCount);

Basically its as-if the Set is the std::vector and Pool is the std::allocator.

3. _TODO: Is there anything special that needs to be done to handle dynamic offsets?_

AFAIK no, only per-non-Vulkan backend class to emulate/track them.

See OpenGL Descriptor Set, it stores a little bit of extra data per each descriptor to track what arguments need to be passed to the multibind functions.

4. _TODO: Things from `IEmulatedDescriptorSet` can be hoisted to `IDescriptorSetLayout`_?

YES, please do

Its a great idea, as IEmulatedDescriptorSet already provides similar functionality to DescriptorLifetimeTracking for ICPUDescriptorSet.

But ofc in the nbl::asset namespace we don't do any pooling (the backing memory for binding records is in the actual ICPUDescriptorSet not any Pool) cause we want to be more agile with our resources.

Special Flags (TODO)

My idea is to have a revision atomic counter which is incremented each time a descriptor binding with neither AFTER_BIND or WHILE_PENDING flag is updated.

First of all to correctly invalidate a commandbuffer after a "forbidden" descriptor set update, we need to know the version of the descriptor set at the time of first bind, so we need to store a unordered_map<IDescriptorSet,IDescriptorSet::revision_counters_v> as a member in IGPUCommandBuffer so we can do some loose validation at submit time (or whenever we check for implicit invalidation).

We can also optimize and not put any DS with that flag on that validation map above, if and only if all bindings have the flags that skip revision counter update.

* `VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT`

This is trivial to support, we don't even validate that all bindings are bound (updated at least once) before we bind or use the descriptor set. If we start validating all bindings are populated, then we should take the flag into account.

* `VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT`

We would simply not increment revision counter when updating any binding with this flag set.

Because this flag only allows us to update bindings which are not used by the command buffer (depending on the presence of PARTIALLY_BOUND the definitions of used changes from static to dynamic usage but we don't check that so we don't care), we don't need to worry about preventing a previously bound descriptor's reference count dropping to zero while a submit might be needing it to stay alive, because its not supposed to be used anyway.

* `VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT`

The support for this flag is a must, too much code and too many examples rely on this behaviour which on OpenGL works out the box.

To facilitate Update after bind, we don't increment that revision counter when updating any binding with that flag set.

Because IGPUCommandBuffer will hold a reference to any bound IGPUDescriptorSet and thanks to the fact that updating a descriptor set currently in use by a pending command buffer is not valid usage (an explcit error), we don't need to worry about "backing up" the old bindings somewhere else.

OpenGL already has such a system and it uses it to detect when it needs to do a rebind (call multibind again) even if the bound DS has not changed. We can leave OpenGL with this default "Update after Bind always on" because the hosted validation will throw errors when we attempt to use this feature without declaring all the flags (this prevents unintended divergence in behaviour between backends while giving no errors w.r.t valid Nabla usage).

Two things to watch out for:

Multiple descriptors with this flag set can be updated concurrently in different threads, though the same descriptor must not be updated concurrently by two threads.

Yet another reason to precompute every offset in the Descriptor Set Layout and make offset getting a const method which performs a binary search.

Also this constrains us to allocating all storage for a pool up front and not reallocating it during its lifetime.

Descriptors with this flag set can be updated concurrently with the set being bound to a command buffer in another thread

Can't think of any caveats right now.

Until we update a descriptor set we don't have the actual ref_ctd_ptr to resources that need to go into the pool's dynamic arrays, that means we need a way to mark the space reserved for a descriptor set until its updated so that other descriptor sets don't try to claim that space in the meanwhile. If we use nullptr to indicate empty space what is a good default value for reserved space?

Well yes, after allocating the space for structs that may include a smart_refctd_ptr you need to run a placement new on the storage (right after allocation, in the descriptor set constructor) to call the constructors, which initialize any smart_refctd_ptr member to nullptr.

Otherwise you end up with random jank in the smart_refctd_ptr and when you try to assign to it, it will try to drop the object pointed to by its internal raw-pointer whose value would be jank.

B. Ditch refctd_dynamic_array entirely and have a fixed size memory block for each descriptor type instead. Then either use a GeneralpurposeAddressAllocator or LinearAddressAllocator depending upon whether freeing of descriptor sets is allowed or not. This will most likely require us to do this (private Discord link). If we do end up going this route I will remove the private link and update the comment with the exact thing that needs to happen.

Do B, and forget about A.

However the discord link doesn't quite work for me.

Lets use Structure of Arrays for the Combined Image Sampler

Since we're using SoA for OpenGL descriptor binding metadata, we can use it everywhere consistenty.

Make your Address Space count in bindings not bytes

This way we have a single index for accessing all SoA data for a binding (even in derived classes like COpenGLDescriptorSet).

Abandon SCombinedImageSampler

This means using a slightly different SBindingInfo for CombinedImageSampler bindings (two separate offsets, to keep, one for the image views and one for the samplers, since you want to skip tracking immutable samplers, which are already tracked by the layout).

Lets keep a count of how many descriptors of a given type there are in a Descriptor Set Layout

This makes certain operations easier, such as allocation and iteration.

Use Dependency Injection to get the backing memory in IDescriptorSet

In software engineering, dependency injection is a design pattern in which an object or function receives other objects or functions that it depends on.

Make the IDescriptorSet declare some virtuals which it does not care about the implementation of.

protected:
      virtual core::smart_refctd_ptr<IDescriptor>* getDescriptorRefcountingStorage(const E_DESCRIPTOR_TYPE type) = 0;
      virtual core::smart_refctd_ptr<ISampler>* getSamplerRefcountingStorage() = 0;

then use these throughout the common hoisted code to deal with lifetime tracking, i.e.

protected:
      // gets called by derived class constructor as soon as setup necessary for `get***RefcountringStorage` is complete 
      inline void createDescriptors()
      {
         allocateDescriptors();
         std::uninitialized_default_construct_n(getSamplerRefcountingStorage(),m_layout->getMutableSamplerCount());
         for (auto type=0u; type<asset::EDT_COUNT; type++)
            std::uninitialized_default_construct_n(getDescriptorRefcountingStorage(type),m_layout->getTotalDescriptorCount(type));
      }
      ~IDescriptorSet()
      {
         std::destroy_n(getSamplerRefcountingStorage(),m_layout->getMutableSamplerCount());
         for (auto type=0u; type<asset::EDT_COUNT; type++)
            std::destroy_n(getDescriptorRefcountingStorage(type),m_layout->getTotalDescriptorCount(type));
      }

The IGPUDescriptorSet can allocate its offsets into IDescriptorPool's storage in its static creation-helper method, pass them to its own constructor, and implement the get**RefcountingStorage methods as adding these offsets to a m_pool->m_descriptorTracking[type] arrays.

While ICPUDescriptorSet can allocate some unique_ptr<core::StorageTrivializer<core::smart_refctd_ptr<IDescriptor>>[]> arrays in its static create method and r-value move these to its constructor.

IDescriptorPool needs to declare IGPUDescriptorSet a friend

In order for the above to work, the GPU Descriptor Set needs to be able to access internal arrays of IDescriptorPool, its bad to declare public methods that give access to the bindings to just anyone, so friendship is preferred.

And while we're at it we can make the allocate and free methods private.

All descriptors inherit from IDescriptor which inherits from IReferenceCounted

So all your DLT functions can deal with core::smart_refctd_ptr<IDescriptor> (and core::smart_refctd_ptr<ISampler> for samplers), instead of reinterpret_casting to and fro uint8_t*.

But of course please keep separate arrays (and start counting from 0) for separate descriptor types, so that SoA metadata can be attached by derived classes.

Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd, you could just do

Wouldn't every command be just fixed size? For example, even for CPipelineBarrierCmd (which I consider could be the prime example of a "variable" size command) we can just have a core::smart_refctd_dynamic_array of images and buffers like this, and then can derive from IGPUCommandPool::IFixedSizeCommand<CRTP>.

Is this not the ideal way to go about this? Now that I think about it, it might not be the ideal way because the dynamically allocated memory via core::smart_refctd_dynamic_array would not be aligned to cache line size and may cause false sharing --which would defeat the entire purpose of making the commands align to cache line size. So we can try something like this (demonstrated for CBindDescriptorSetsCmd). It does seem a bit strange though.

This is the correct way https://github.com/Devsh-Graphics-Programming/Nabla/blob/915dbc4f3374389ba27b1fe40631f206b91d574a/include/nbl/video/IGPUCommandPool.h#L386

but obviously a bad example because engine max descriptor set count is hardcoded to 4.

Commands affected by this: CBindDescriptorSetsCmd, CPipelineBarrierCmd, CNamedBufferSubDataCmd, IGPUCommandPool::CExecuteCommands, COpenGLCommandPool::CExecuteCommands, CWaitEventsCmd, CWriteAccelerationStructurePropertiesCmd, CBuildAccelerationStructuresCmd.

Why would CNamedBufferSubDataCmd be affected?

but obviously a bad example because engine max descriptor set count is hardcoded to 4.

Okay I will change this to a flat array.

Commands affected by this: CBindDescriptorSetsCmd, CPipelineBarrierCmd, CNamedBufferSubDataCmd, IGPUCommandPool::CExecuteCommands, COpenGLCommandPool::CExecuteCommands, CWaitEventsCmd, CWriteAccelerationStructurePropertiesCmd, CBuildAccelerationStructuresCmd.

Why would CNamedBufferSubDataCmd be affected?

It is because of this member https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652

achalpandeyy avatar Sep 30 '22 11:09 achalpandeyy

but obviously a bad example because engine max descriptor set count is hardcoded to 4.

Okay I will change this to a flat array.

Commands affected by this: CBindDescriptorSetsCmd, CPipelineBarrierCmd, CNamedBufferSubDataCmd, IGPUCommandPool::CExecuteCommands, COpenGLCommandPool::CExecuteCommands, CWaitEventsCmd, CWriteAccelerationStructurePropertiesCmd, CBuildAccelerationStructuresCmd.

Why would CNamedBufferSubDataCmd be affected?

It is because of this member

https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652

Ah ok you meant COpenGLCommandPool::CNamedBufferSubDataCmd not a generic tape command.

but obviously a bad example because engine max descriptor set count is hardcoded to 4.

Okay I will change this to a flat array.

Commands affected by this: CBindDescriptorSetsCmd, CPipelineBarrierCmd, CNamedBufferSubDataCmd, IGPUCommandPool::CExecuteCommands, COpenGLCommandPool::CExecuteCommands, CWaitEventsCmd, CWriteAccelerationStructurePropertiesCmd, CBuildAccelerationStructuresCmd.

Why would CNamedBufferSubDataCmd be affected?

It is because of this member https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652

Ah ok you meant COpenGLCommandPool::CNamedBufferSubDataCmd not a generic tape command.

There isn't a generic CNamedBufferSubDataCmd. I don't think its required because whichever IGPUCommandBuffer method eventually emplaces this command on the backend-specific tape would have already saved a reference to the ref counted resource in the generic tape.

achalpandeyy avatar Oct 03 '22 10:10 achalpandeyy

Stuff Carried over from #408

https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r940636518 https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r1040830105 https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r940693463

Some ideas to reduce complexity of Segment List of Commands

Now that we have explored the problem space more thoroughly, allow me to suggest a better design, which I believe makes things a lot cleaner without adding significant overhead (if any):

1. Get rid of CCommandSegment::Iterator entirely, and add ~CCommandSegment

Make CCommandSegment responsible for freeing all of its commands by adding a destructor, like:

~CCommandSegment()
{
     auto* cmd = reinterpret_cast<ICommand*>(m_data);
     while ((cmd != segment_end()) || (cmd->getSize() != 0))
     {
        auto* nextCmd = reinterpret_cast<ICommand*>(reinterpret_cast<uint8_t*>(cmd) + cmd->getSize());
        cmd->~ICommand();
        cmd = nextCmd;
     }
}

Note: this can be extended easily. For example, the GL backend required to execute commands, so we could've added a CCommandSegment::execute which just executes all of the commands in the segment.

This will make the deleteCommandSegmentList routine as simple as:

void deleteCommandSegmentList(CCommandSegment*& head, CCommandSegment*& tail)
{
    if (!head)
        return;

    auto freeSegment = [this](CCommandSegment*& segment)
    {
	    if (segment)
		{
			segment->~CCommandSegment();
			m_pool.deallocate(segment, COMMAND_SEGMENT_SIZE);
			segment = nullptr;
		}
	};

	for (auto& segment = head; segment;)
	{
		auto nextSegment = segment->getNext();
		freeSegment(segment);
		segment = nextSegment;
	}

	// No need to nullify head.
	if (tail)
		tail = nullptr;
}

2. The List-of-List data structure to facilitate Command Pool reset

As mentioned here https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r1041104593, we make a doubly linked list of segment list heads to facilitate resetting of the command pool. Including this comment as well https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r1040939380, I think we can make a reusable data structure as follows:

class CCommandListPool
    {
    public:
        CCommandListPool() : m_pool(COMMAND_SEGMENTS_PER_BLOCK*COMMAND_SEGMENT_SIZE, 0u, MAX_COMMAND_SEGMENT_BLOCK_COUNT, MIN_POOL_ALLOC_SIZE) {}

        bool appendToList(CCommandSegment*& head, CCommandSegment*& tail)
        {
            auto segment = m_pool.emplace<CCommandSegment>();
            if (!segment)
            {
                assert(false);
                return false;
            }

            if (tail)
            {
                tail->setNext(segment);
                tail = tail->getNext();
            }
            else
            {
                assert(!head && "List should've been empty.");

                tail = segment;
                head = segment;

                // Add to the list to pool of lists.
                {
                    if (!m_head && !m_tail)
                    {
                        m_head = segment;
                        m_tail = segment;
                    }
                    else
                    {
                        assert(m_head && m_tail);

                        m_tail->setNextHead(segment);
                        segment->setPrevHead(m_tail);
                        m_tail = m_tail->getNextHead();
                    }
                }
            }

            return bool(segment);
        }

        void deleteList(CCommandSegment*& head, CCommandSegment*& tail)
        {
            if (!head)
                return;

            // Remove the list from pool of lists.
            {
                assert(m_head && m_tail && "Cannot have a freely dangling command list.");

                auto oneBefore = head->getPrevHead();
                auto oneAfter = head->getNextHead();

                if (oneAfter)
                    oneAfter->setPrevHead(oneBefore);

                if (oneBefore)
                    oneBefore->setNextHead(oneAfter);

                if (!oneBefore && !oneAfter)
                {
                    m_head = nullptr;
                    m_tail = nullptr;
                }
            }

            auto freeSegment = [this](CCommandSegment*& segment)
            {
                if (segment)
                {
                    segment->~CCommandSegment();
                    m_pool.deallocate(segment, COMMAND_SEGMENT_SIZE);
                    segment = nullptr;
                }
            };

            for (auto& segment = head; segment;)
            {
                auto nextSegment = segment->getNext();
                freeSegment(segment);
                segment = nextSegment;
            }

            // No need to nullify head.
            if (tail)
                tail = nullptr;
        }

        void clear()
        {
            CCommandSegment* dummy = nullptr;
            for (auto* currHead = m_head; currHead;)
            {
                auto* nextHead = currHead->getNextHead();
                // We don't (and also can't) nullify the tail here because when the command buffer detects that its parent pool has been resetted
                // it nullifies both head and tail itself.
                deleteList(currHead, dummy);
                currHead = nextHead;
            }

            m_head = nullptr;
            m_tail = nullptr;
        }

    private:
        CCommandSegment* m_head = nullptr;
        CCommandSegment* m_tail = nullptr;
        core::CMemoryPool<core::PoolAddressAllocator<uint32_t>, core::default_aligned_allocator, false, uint32_t> m_pool;
    };

One CCommandListPool per IGPUCommandPool which will allocate CCommandLists (essentially a pair of CCommandSegment*), per IGPUCommandBuffer.

After the changes are made, the IGPUCommandPool::emplace and IGPUCommandPool::deleteCommandSegmentList methods become trivial:

 template <typename Cmd, typename... Args>
    Cmd* emplace(CCommandSegment*& segmentListHead, CCommandSegment*& segmentListTail, Args&&... args)
    {
        if (!segmentListTail && !m_commandListPool.appendToList(segmentListHead, segmentListTail))
            return nullptr;

        auto newCmd = [&]() -> Cmd*
        {
            void* cmdMem = segmentListTail->allocate<Cmd>(args...);
            if (!cmdMem)
                return nullptr;

            return new (cmdMem) Cmd(std::forward<Args>(args)...);
        };

        auto cmd = newCmd();
        if (!cmd)
        {
            if (!m_commandListPool.appendToList(segmentListHead, segmentListTail))
                return nullptr;

            cmd = newCmd();
            if (!cmd)
            {
                assert(false);
                return nullptr;
            }
        }

        return cmd;
    }

void deleteCommandSegmentList(CCommandSegment*& segmentListHead, CCommandSegment*& segmentListTail)
    {
        m_commandListPool.deleteList(segmentListHead, segmentListTail);
    }

So might as well move them to IGPUCommandBuffer, and make it a friend of IGPUCommandPool so it can access the data structure CCommandListPool.

achalpandeyy avatar Dec 12 '22 06:12 achalpandeyy

The function to which most of the ~IGPUDescriptorSet needs to move to, IDescriptorPool::deleteSetStorage can be called from any thread.

A similar concern exists for IGPUCommandBuffer::releaseResourcesBackToPool() which may be called by ~IGPUCommandBuffer.

This should be easy to solve with a mutex in the interface class, but only after both functions are written so we know what to protect (probably only vk calls using the pool as an argument + allocators).

We can still run all constructors, initializers and destructors of the allocated storage without a mutex.

Note that this is only limited thread-safeness, not a full-on "provide all external synchronization guarantees for every method as required by Vulkan spec".

P.S. Also why doesn't your CVulkanCommandBuffer call vkFreeCommandBuffers !?