Nabla
Nabla copied to clipboard
[WIP] Descriptor lifetime tracking & Command Pool revamp
Description
Testing
TODO list:
- [x] OpenGL command buffer execution (
COpenGLCommandBuffer::executeAll
). - [x] Merge
master
and get things working. - [x]
calc_size
method is not needed to be defined explicitly for "fixed size commands". By "fixed size commands", we mean the commands whose size is know at compile-time. For example:CDrawIndexedIndirectCountCmd
andCBeginRenderPassCmd
(among many others) are fixed size commands, whereasCPipelineBarrierCmd
is not --because the number of resources the command operates on, is not known at compile-time. - [x] Fix the inversion of control happening between
IGPUCommandPool
andIGPUCommandBuffer
. Related comment. - [x] Have two separate segmented lists for non Vulkan-like backends i.e. those rendering backends which need to hold on to some data other than
smart_refctd_ptr
of descriptors to execute its commands, such as the OpenGL backend. Where the first segmented list will hold thesmart_refctd_ptr
of descriptors (common to both Vulkan-like and non Vulkan-like backends) and the second one will hold on to other data. More details in this comment. - [ ] Command Pool Resetting.
- [x] Turn all
ICommandBuffer
methods into purevirutal
s. - [x] Hoist all state transition and validation to
IGPUCommandBuffer
. Backend specific_impl
functions would returnvoid
. - [x] Detect if the command pool of a command buffer has been already reset with the help of counters in
IGPUCommandPool
andIGPUCommandBuffer
as mentioned in #373 and as already being done in #408 (works like a versioning system), and do NOT release resources in this case. - [ ] Check for command pool reset before every command buffer method.
- [x] Hoist
releaseResourcesBackToPool
toIGPUCommandBuffer
. Context: #408. - [ ]
IGPUCommandBuffer
's last-reset-stamp/counter should be set inbegin
and not inresetCommon
as currently being done in #408. - [ ] Figure out a way to release the command segment list memory on pool reset without keeping track of child command buffer heads.
- [x] Turn all
- [x] OpenGL Command Buffer Baking.
- [x] When executing certain pre-recorded GL commands it is required to have access to the FBO, VAO and Pipeline caches. If we store these caches in
COpenGL_Queue::CThreadHandler::ThreadInternalStateType
then they can be just passed along toCOpenGLCommandBuffer::executeAll
called inCThreadHandler::process_request
.- [x] FBO cache.
- [x] VAO cache.
- [x] Pipeline cache.
- [x] When executing certain pre-recorded GL commands it is required to have access to the FBO, VAO and Pipeline caches. If we store these caches in
- [x] Test and then remove the old command pool system.
- [x] Make Example 1 work.
- [x] Make Example 2 work.
- [x] Make Example 6 work.
- [x] Make Example 7 work.
- [x] Add more commands.
- [x]
~CPipelineBarrierCmd
has a crash bug for some reason. - [x] Implement
COpenGLCommandBuffer::bindComputePipeline
. (SOpenGLState::pipeline::compute
can no longer have ausedShader
member. - [x] This (private Discord link) is possibly related.
- [x] Add the commands which were not already added as the by product of making the aforementioned examples work.
- [x]
- [x] Remove the old command pool system.
- [ ] Descriptor Lifetime Tracking
Review of the Interfaces
Bug
Using std::forward
more than once:
https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r863819861
Command segment won't really be 128k: https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r863768889
Design Failure
There's a inversion of control going on, IGPUCommandBuffer is calling methods that IGPUCommandPool
should keep protected or even private.
Therefore IGPUCommandBuffer::emplace
and possibly even the deleteCommandSegmentList
(as a function taking a head segment parameter) should move to IGPUCommandPool
.
Quality of Life improvements
Common CRTP base to provide calc_size
To avoid having to type out the same definition of calc_size
for every fixed size command, you could have
template<class CRTP>
class NBL_FORCE_EBO IGPUCommandPool::IFixedSizeCommand
{
public:
template<typename... Args>
static uint32_t calc_size(const Args&... args)
{
return sizeof(CRTP);
}
};
Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd
, you could just do
class IGPUCommandPool::CDrawIndexedIndirectCountCmd : public IGPUCommandPool::CDrawIndirectCommonBase, public IGPUCommandPool::IFixedSizeCommand<IGPUCommandPool::CDrawIndexedIndirectCountCmd>
{
public:
CDrawIndexedIndirectCountCmd(const core::smart_refctd_ptr<const video::IGPUBuffer>& buffer, const core::smart_refctd_ptr<const video::IGPUBuffer>& countBuffer)
: CDrawIndirectCommonBase(buffer, calc_size(buffer, countBuffer)), m_countBuffer(countBuffer)
{}
protected:
core::smart_refctd_ptr<const IGPUBuffer> m_countBuffer;
};
Abandon inheritance from Interface Commands, have a separate tape for Backend Specific commands
Ok this one is my own "Design Failure", basically I thought to have the linear array of commands record for every backend, but the backend to derive (inherit) from the interface command in order to add extra data.
So when a Vulkan tape would look like this:
{cmdSize,bindPipelineGeneralParams},
{cmdSize,bindDescriptorSetsGeneralParams},
{cmdSize/*,pushConstantsGeneralEmptyStruct*/},
{cmdSize/*,dispatchGeneralEmptyStruct*/},
An OpenGL tape would look like this (via the virtues of inheritance):
{cmdSize,bindPipelineGeneralParams/*,nothingExtraNeeded*/},
{cmdSize,bindDescriptorSetsGeneralParams/*,nothingExtraNeeded*/},
{cmdSize/*,pushConstantsGeneralEmptyStruct*/,pushConstantData},
{cmdSize/*,dispatchGeneralEmptyStruct*/,dispatchParams},
However things start to get hairy when you think about the ability to generate "pruned" command lists (such as thanks to OpenGL validation, or some things just being NOOPs on Vulkan).
If you wanted to not record certain commands because it doesn't make sense (because Vulkan doesn't need to replay them or reencode them) you end up having special checks for them.
If you start "skipping" commands that have no effect on the state, you may miss holding onto some smart_refctd_ptr
references and mess up the whole lifetime tracking of the framework (basically if you do 5 descriptor set binds in quick succession, whether your descriptor sets would be kept alive would depend on the backend you're using which would be insane).
Furthermore, its just bad for speed and cache, since you'd be skipping over a lot of data you don't need on a particular backend when iterating.
So my new proposal is to make all the interface commands final
and for non-Vulkan backends, have two tapes (but still allocated from the same pool, just not sharing segments) one in IGPUCommandBuffer which is the "general tape" for keeping refcounted pointers alive, and another in the derived backend class just for storing backend specific command (to iterate over when execute
ing). The backend specific commands would inherit directly from ICommand
as usual (and IFixesSizeCommand
whenever applicable).
So then our General tape looks like this (only cares about lifetime tracking for Vulkan-like API):
{cmdSize,bindPipelineGeneralParams},
{cmdSize,bindDescriptorSetsGeneralParams}
and it doesn't need to care about NOOPs such as draw, dispatch, etc.
And our OpenGL tape would just care about storing the necessary info for when exectuteAll
is ran, so it wouldn't need to hold onto any smart_refctd_ptr
at all.
{cmdSize,bindTexturesCmd},
{cmdSize,bindSSBOsCmd},
{cmdSize,pushConstantData},
{cmdSize,dispatchParams}
then the commands in a GL tape wouldn't need to correspond 1:1 with the commands in the general tape.
Therefore IGPUCommandBuffer::emplace and possibly even the deleteCommandSegmentList (as a function taking a head segment parameter) should move to IGPUCommandPool.
For these:
m_cmdpool->m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));
and
m_cmdpool->m_commandSegmentPool.deallocate(currSegment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);
can I make wrappers:
CommandSegment* IGPUCommandPool::allocateCommandSegment()
{
return m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));
}
and
void IGPUCommandPool::deallocateCommandSegment(CommandSegment* segment)
{
m_commandSegmentPool.deallocate(segment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);
}
and not move both of the functions IGPUCommandBuffer::emplace
and IGPUCommandBuffer::deleteCommandSegmentList
to IGPUCommandPool
? If we do this, we wouldn't have to pass the redundant parameters (like m_segmentListHeadItr
and m_segmentListTailItr
) every time we emplace
a new command in C{Vulkan/OpenGL}CommandBuffer
functions, like:
bool bindIndexBuffer(..)
{
..
m_cmdpool->emplace<IGPUCommandPool::CBindIndexBufferCmd>(m_segmentListHeadItr, m_segmentListTail, /*other required params*/);
..
}
and we can do:
bool bindIndexBuffer(..)
{
..
emplace<IGPUCommandPool::CBindIndexBufferCmd>(/*other required params*/);
..
}
OpenGL Command Buffer Baking
What is happening now
-
SOpenGLContextLocalCache
is cached in the command buffer'sCOpenGL_Queue
's thread internal state (ThreadInternalStateType
). - The cached
SOpenGLContextLocalCache
is modified (eventually to reflect new GL context state on "flush") at command-buffer-execution-time. - "Flushing" (by functions such as,
flushStateGraphics
andflushStateCompute
etc) calls GL functions immediately. -
SCmd<>
structures are analogues tovkCmd*
. - Non-shareable GL objects such as VAO, FBO and Shader Programs (because of cached uniform state) are cached inside
SOpenGLContextLocalCache
inside queue's thread's internal state.
What we want
-
SOpenGLContextLocalCache
is cached in the command buffer itself. - The cached
SOpenGLContextLocalCache
is modified (eventually to reflect new GL context state on "flush") at command-buffer-record-time. This comment highlights the difference. - "Flushing" (by functions such as,
flushStateGraphics
andflushStateCompute
etc) records GL functions in the command buffer to execute them later when the command buffer executes. This is important to not let the GL context get out of sync with what the currently executing command expects. - In light of the point above,
SCmd<>
structures are analogues to GL functions, such asglBindVertexArray
(with slight renaming in accordance with this) such that they can be recorded in the command buffer when a "flush" function is called. - According to the very first point, now since the stuff in
SOpenGLContextLocalCache
is cached in the command buffer, non-shareable GL objects should stay in the queue's cache and thus separate formSOpenGLContextLocalCache
.
Therefore IGPUCommandBuffer::emplace and possibly even the deleteCommandSegmentList (as a function taking a head segment parameter) should move to IGPUCommandPool.
For these:
m_cmdpool->m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment));
and
m_cmdpool->m_commandSegmentPool.deallocate(currSegment, IGPUCommandPool::COMMAND_SEGMENT_SIZE);
can I make wrappers:
CommandSegment* IGPUCommandPool::allocateCommandSegment() { return m_commandSegmentPool.allocate(IGPUCommandPool::COMMAND_SEGMENT_SIZE, alignof(IGPUCommandPool::CommandSegment)); }
and
void IGPUCommandPool::deallocateCommandSegment(CommandSegment* segment) { m_commandSegmentPool.deallocate(segment, IGPUCommandPool::COMMAND_SEGMENT_SIZE); }
and not move both of the functions
IGPUCommandBuffer::emplace
andIGPUCommandBuffer::deleteCommandSegmentList
toIGPUCommandPool
? If we do this, we wouldn't have to pass the redundant parameters (likem_segmentListHeadItr
andm_segmentListTailItr
) every time weemplace
a new command inC{Vulkan/OpenGL}CommandBuffer
functions, like:bool bindIndexBuffer(..) { .. m_cmdpool->emplace<IGPUCommandPool::CBindIndexBufferCmd>(m_segmentListHeadItr, m_segmentListTail, /*other required params*/); .. }
and we can do:
bool bindIndexBuffer(..) { .. emplace<IGPUCommandPool::CBindIndexBufferCmd>(/*other required params*/); .. }
all suggestions are fine I guess
OpenGL Command Buffer Baking
What is happening now
SOpenGLContextLocalCache
is cached in the command buffer'sCOpenGL_Queue
's thread internal state (ThreadInternalStateType
).- The cached
SOpenGLContextLocalCache
is modified (eventually to reflect new GL context state on "flush") at command-buffer-execution-time.- "Flushing" (by functions such as,
flushStateGraphics
andflushStateCompute
etc) calls GL functions immediately.SCmd<>
structures are analogues tovkCmd*
.
well not really analogues, but they keep all arguments necessary to carry out the command's equivalent in a deferred way on another thread.
- Non-shareable GL objects such as VAO, FBO and Shader Programs (because of cached uniform state) are cached inside
SOpenGLContextLocalCache
inside queue's thread's internal state.What we want
SOpenGLContextLocalCache
is cached in the command buffer itself.
Also only used/relevant between for methods called between a begin
and end
.
- The cached
SOpenGLContextLocalCache
is modified (eventually to reflect new GL context state on "flush") at command-buffer-record-time. This comment highlights the difference.- "Flushing" (by functions such as,
flushStateGraphics
andflushStateCompute
etc) records GL functions in the command buffer to execute them later when the command buffer executes. This is important to not let the GL context get out of sync with what the currently executing command expects.- In light of the point above,
SCmd<>
structures are analogues to GL functions, such asglBindVertexArray
(with slight renaming in accordance with this) such that they can be recorded in the command buffer when a "flush" function is called.
Wherever you have a glCall
in the flush function, you should have an emplace
onto the second tape and a unique struct type.
- According to the very first point, now since the stuff in
SOpenGLContextLocalCache
is cached in the command buffer, non-shareable GL objects should stay in the queue's cache and thus separate formSOpenGLContextLocalCache
.
Yes, so in the structs used for the above, you need to keep some data (pointer and/or hash value) needed to find or create the non-shareable GL objects.
Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd, you could just do
Wouldn't every command be just fixed size? For example, even for CPipelineBarrierCmd
(which I consider could be the prime example of a "variable" size command) we can just have a core::smart_refctd_dynamic_array
of images and buffers like this, and then can derive from IGPUCommandPool::IFixedSizeCommand<CRTP>
.
Is this not the ideal way to go about this?
Now that I think about it, it might not be the ideal way because the dynamically allocated memory via core::smart_refctd_dynamic_array
would not be aligned to cache line size and may cause false sharing --which would defeat the entire purpose of making the commands align to cache line size. So we can try something like this (demonstrated for CBindDescriptorSetsCmd
). It does seem a bit strange though.
Commands affected by this: CBindDescriptorSetsCmd
, CPipelineBarrierCmd
, CNamedBufferSubDataCmd
, IGPUCommandPool::CExecuteCommands
, COpenGLCommandPool::CExecuteCommands
, CWaitEventsCmd
, CWriteAccelerationStructurePropertiesCmd
, CBuildAccelerationStructuresCmd
.
Descriptor Lifetime Tracking
GPU resources, referenced by descriptor sets, go out of scope and get destroyed before we get the chance to bind the descriptor set and use them. To solve this, we make the descriptor set hold on to smart_ref_ctd_ptr
to these resources.
When we create a descriptor pool we declare how many descriptors of each type it can hold, at maximum, via the poolSizeCount
and poolSizes
params.
core::smart_refctd_ptr<IDescriptorPool> createDescriptorPool(
IDescriptorPool::E_CREATE_FLAGS flags, uint32_t maxSets, uint32_t poolSizeCount,
const IDescriptorPool::SDescriptorPoolSize* poolSizes) override
Note: Don't confuse it with the maxSets
member which tells you how many descriptor sets can be allocated from a pool, for all intents and purposes, it can be thought of as referring to the VkDescriptorSet
handles themselves --which would occupy a constant space regardless of what or how many descriptors the set references.
Using these params we can create core::smart_refctd_dynamic_array
for every descriptor type and store it in the IDescriptorPool
instance. IGPUDescriptorSet
would then, store pointers (or offsets) into these dynamic arrays.
Note: Immutable samplers are kept alive by IGPUDescriptorSetLayout
via the SBinding::samplers
member. IGPUDescriptorSetLayout
in turn is kept alive by the IGPUDescriptorSet
, so nothing needs to be done for immutable samplers. However, for mutable samplers lifetime tracking is required, and they can be treated as just another descriptor type with their own separate memory block or refctd_dynamic_array
(see Open Question 1 below).
Sidenote: This should be a raw pointer because SImageInfo
is a throw-away structure mainly used to update the descriptor sets and AFAICT we don't care about lifetime there.
Since we know the descriptor set layout when we create a IGPUDescriptorSet
,
core::smart_refctd_ptr<IGPUDescriptorSet> createDescriptorSet_impl(IDescriptorPool* pool,
core::smart_refctd_ptr<const IGPUDescriptorSetLayout>&& layout) override;
we can figure out the exact number of descriptors, of each type, that the set can ever use and that need to be allocated from the dynamic arrays stored in its parent pool.
Descriptor sets allocated from a descriptor pool must not be freed (via vkFreeDescriptorSets
) unless the pool is created with the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
bit set.
Descriptor set allocation failure due to fragmentation should not happen if:
- Descriptor pool is not created with the
VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
bit set. - Descriptor pool is created with the
VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
bit set BUT it has not had any descriptor sets freed since it was created or most recently reset. - If all sets allocated from the pool since it was created or most recently reset use the same number of descriptors (of each type) and the requested allocation also uses that same number of descriptors (of each type).
General Idea/Algorithm
When creating a IGPUDescriptorSet
iterate through the given IGPUDescriptorSetLayout
's bindings and depending on the descriptor type reserve space in the parent descriptor pool's descriptor storage (see Open Question 1 for more details). Store the offsets/raw pointers into the pool's descriptor storage, in IGPUDescriptorSet
. When updating the descriptor set, use these offsets/raw pointers to put the actual resource's ref_ctd_ptr
s into the pool's storage.
Required (and Foreseeable) Modifications to the OpenGL Backend
- Currently, the storage of
smart_refctd_ptr<IDescriptor>
comes fromCOpenGLDescriptorSet
in the form ofrefctd_dynamic_array
s for each descriptor type. This will be removed as we're adding this storage inIDescriptorPool
so that it remains common for all backends. All of this should move to the common baseIGPUDescriptorSet
. - Storage for these OpenGL specific arrays should come from a (currently non-existing)
COpenGLDescriptorPool
? That means we pool the allocation for these arrays, for allCOpenGLDescriptorSet
s that will ever be allocated from this pool. Again, this can be done because we know, beforehand, the exact number of descriptors of each type that will be allocated from this pool. Allocation from the pool's dynamic arrays/memory block should follow the same policies regarding allocation failure due to fragmentation as mentioned above. TheSMultibindParams
will be initialized in theCOpenGLDescriptorSet
constructor as it is happening currently.COpenGLDescriptorSet
could also store offsets/raw pointers into these OpenGL specific arrays if required. - TODO: Is there anything special that needs to be done to handle dynamic offsets?
-
TODO: Things from
IEmulatedDescriptorSet
can be hoisted toIDescriptorSetLayout
?
Special Flags (TODO)
-
VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT
-
VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT
-
VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT
Open Questions:
- Until we update a descriptor set we don't have the actual
ref_ctd_ptr
to resources that need to go into the pool's dynamic arrays, that means we need a way to mark the space reserved for a descriptor set until its updated so that other descriptor sets don't try to claim that space in the meanwhile. If we usenullptr
to indicate empty space what is a good default value for reserved space?
I have some suggestions for this:
A. Make refctd_dynamic_array
of
struct
{
bool isReserved;
core::smart_refctd_ptr<const IDescriptor> descriptor;
};
instead of just core::smart_refctd_ptr<const IDescriptor>
.
B. Ditch refctd_dynamic_array
entirely and have a fixed size memory block for each descriptor type instead. Then either use a GeneralpurposeAddressAllocator
or LinearAddressAllocator
depending upon whether freeing of descriptor sets is allowed or not. This will most likely require us to do this (private Discord link). If we do end up going this route I will remove the private link and update the comment with the exact thing that needs to happen.
Using these params we can create core::smart_refctd_dynamic_array for every descriptor type and store it in the IDescriptorPool instance. IGPUDescriptorSet would then, store pointers (or offsets) into these dynamic arrays.
What's the template parameter in the array?
In my opinion it would be sufficient to create a "plain" array aligned to the Backend's Final Class Descriptor Set's "binding record" and large enough to store that many multiples of it.
for example
protected:
struct SCombinedSampler
{
core::smart_refctd_ptr<IGPUImageView> view;
core::smart_refctd_ptr<IGPUSampler> sampler;
};
// a struct that wont trigger any ctors/dtors etc. also facilitates pointer arithmetic
template<typename T>
struct alignas(T) StorageTrivializer
{
uint8_t storage[sizeof(T)];
};
std::unique_ptr<StorageTrivializer<SCombinedSampler>[]> combinedSamplerStorage;
std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUImageView>>[]> imageStorage; // storage, and input attachment
std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUBufferView>>[]> UTB_STBstorage;
std::unique_ptr<StorageTrivializer<core::smart_refctd_ptr<IGPUBuffer>>[]> UBO_SSBOStorage; // includes dynamics
The arrays for storing auxiliary backend stuff like SMultibindParams
you can allocate and manage separately in the derived class.
You can however generalize and precompute offsets from binding to storage index (you can query via binary search/lower bound) for each descriptor set layout.
Then to map from DS + binding + type + index in binding array -> storage index in pool all you need to do is binary search that offset within a DS and add the DS's storage offset with the pool.
Hint, maybe hoist these? I don't remember what they do but seems similar https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/src/nbl/video/COpenGLDescriptorSet.h#L356
P.S. In the asset:E_DESCRIPTOR_TYPE
enum, we should probably comment out ACCELERATION_STRUCTURE_NV
, INLINE_UNIFORM_BLOCK_EXT
and MUTABLE_VALVE
as we don't support these. Also rename ACCELERATION_STRUCTURE_KHR
to just ACCELERATION_STRUCTURE
.
Note: Immutable samplers are kept alive by IGPUDescriptorSetLayout via the SBinding::samplers member.
Correct.
IGPUDescriptorSetLayout in turn is kept alive by the IGPUDescriptorSet, so nothing needs to be done for immutable samplers.
Correct.
However, for mutable samplers lifetime tracking is required, and they can be treated as just another descriptor type with their own separate memory block or refctd_dynamic_array (see Open Question 1 below).
~~Just use a struct thats essentially a pair of smart_pointer of IGPUImage and IGPUSampler, see the code snippet above.~~
SoA best
Sidenote: This should be a raw pointer because SImageInfo is a throw-away structure mainly used to update the descriptor sets and AFAICT we don't care about lifetime there.
No because desc
is a smart pointer too, c.f. https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/include/nbl/asset/IDescriptorSet.h#L59
We use smart pointers for arguments because you don't need to keep the smartpointer references alive somewhere else while you build the SWriteDescriptorSet
or SCopyDescriptorSet
argument arrays.
Pure ergonomics, right now we can create the image/buffer/sampler and assign right away to a SDescriptorInfo
struct field and not worry about lifetimes, if we made the change we'd need a whole load of temporary variables
we can figure out the exact number of descriptors, of each type, that the set can ever use and that need to be allocated from the dynamic arrays stored in its parent pool.
Yes, this means that you can create a mapping from every binding to an offset in some array that provides backing storage for some metadata structs (OpenGL does this but the storage is SoA not AoS because of how you pass parameters to OpenGL multibind, also it groups descriptors by type into a single array differently).
e.g.
struct BindingToStorage
{
uint32_t binding;
uint32_t storageOffset;
// if you need to store some other offsets, then inherit from the struct and use that
inline bool operator<(const BindingToStorage& other) const
{
return binding<other.binding;
}
};
Then you can binary search for the binding you wish to modify/get, add the offset (for an array binding) and voila you have your storage slot.
Descriptor set allocation failure due to fragmentation should not happen if:
* Descriptor pool is not created with the `VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT` bit set.
This is trivial to satisfy, as you can use a LinearAddressAllocator (or just a pointer/cursor that you bump).
If nothing is ever freed, its impossible to fragment and fail an allocation of any storage block as long as total number of descriptors of given type is below the up-front declared maximum.
- Descriptor pool is created with the
VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
bit set BUT it has not had any descriptor sets freed since it was created or most recently reset.
This is basically a situation where you have a brand new "clean slate" allocator and the constraint is that it must behave like a linear allocator.
Well I have good news for you, the GeneralPurposeAddressAllocator
will give you exact same allocations that LinearAddressAllocator
would do as long as:
- the minimum allocation or block size is
1
(or there was no allocation less than the minimum) - the alignments of subsequent
alloc
s have been the same - there have been no
free
s since the lastreset
- If all sets allocated from the pool since it was created or most recently reset use the same number of descriptors (of each type) and the requested allocation also uses that same number of descriptors (of each type).
GeneralPurposeAddressAllocator will also satisfy this requirement.
Conclusion
To allocate offsets into storage, use GeneralPurposeAddressAllocator<uint32_t>
with a minimum allocation and block size of 1 if pool has the VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
flag, otherwise use LinearAddressAllocator<uint32_t>
.
Important points for both allocators:
- always use an alignment of
1
- allocate the offset for all descriptors of a given type in a pool at once (only in this way can you satisfy the fragmentation constraints of Vulkan!)
When creating a IGPUDescriptorSet iterate through the given IGPUDescriptorSetLayout's bindings and depending on the descriptor type reserve space in the parent descriptor pool's descriptor storage (see Open Question 1 for more details). Store the offsets/raw pointers into the pool's descriptor storage, in IGPUDescriptorSet. When updating the descriptor set, use these offsets/raw pointers to put the actual resource's ref_ctd_ptrs into the pool's storage.
As a corollary of my previous 4 comments, we can see that an IGPUDescriptorSet
must hold the following:
core::smart_refctd_ptr<IGPUDescriptorSetLayout> m_layout; // both to keepalive and to query offsets into storage
core::smart_refctd_ptr<IGPUDescriptorPool> m_pool; // both to keepalive and to obtain memory for bindings
// offsets of the set's allocations within the pool to apply as a shift to all layout precomputed offsets
uint32_t m_combinedSamplerStorageOffset;
uint32_t m_imageViewStorageOffset;
uint32_t m_UTB_STBStorageOffset;
uint32_t m_UBO_SSBOStorageOffset;
// if you have some other arrays per-backed with extra data, then store more offsets in derived class
P.S. Its best if you precompute the offsets per-binding and store them in the DS layout, you only need to do it once.
Required (and Foreseeable) Modifications to the OpenGL Backend
1. Currently, the storage of `smart_refctd_ptr<IDescriptor>` comes from `COpenGLDescriptorSet` in the form of `refctd_dynamic_array`s for each descriptor type. This will be removed as we're adding this storage in `IDescriptorPool` so that it remains common for all backends. All of [this](https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/src/nbl/video/COpenGLDescriptorSet.h#L53-L90) should move to the common base `IGPUDescriptorSet`.
You need similar code in IDescriptorSetLayout
, but this OpenGL code computes slightly different offsets actually. The reason is because OpenGL groups some descriptor types together (UTB and Combined Samplers share slot address space).
Actually the main class providing the slots and mapping to core::smart_refctd_ptr
is IEmulatedDescriptorSet
. So we've always had a "working" version of DLT in OpenGL, just not in VK.
Also that code should move from COpenGLDescriptorSet
to COpenGLDescriptorSetLayout
.
2. Storage for [these](https://github.com/Devsh-Graphics-Programming/Nabla/blob/c009091c572c4beaa587f4dff7b343a7ad67a843/src/nbl/video/COpenGLDescriptorSet.h#L92-L128) OpenGL specific arrays should come from a (currently non-existing) `COpenGLDescriptorPool`?
Yes.
That means we pool the allocation for these arrays, for all
COpenGLDescriptorSet
s that will ever be allocated from this pool. Again, this can be done because we know, beforehand, the exact number of descriptors of each type that will be allocated from this pool. Allocation from the pool's dynamic arrays/memory block should follow the same policies regarding allocation failure due to fragmentation as mentioned above.
Good conclusion, as you can see that my last 5 comments concur with your line of reasoning.
The `SMultibindParams` will be initialized in the `COpenGLDescriptorSet` constructor as it is happening currently. `COpenGLDescriptorSet` could also store offsets/raw pointers into these OpenGL specific arrays if required.
Yes, here's the huge pitfall we need to avoid (smartpointer circular references).
The Descriptor Pool only allocates and hands out raw bytes.
Its the Descriptor Set that must call placement new and the destructor to initialize/deinitialize the arrays of descriptor binding data.
In the Constructor:
const uint32_t bufferBindingCount = m_layout->getBufferBindingCount();
uint32_t offset = m_pool->allocateBufferBindings(bufferBindingCount);
SBufferBinding* bufferBindings = new(m_pool->getBufferBindings()+offset) SBufferBinding[bufferBindingCount];
In the Destructor:
for (auto i=0; i<m_layout->getBufferBindingCount(); i++)
bufferBindings->~SBufferBinding();
m_pool->freeBufferBindings(bufferBindingCount);
Basically its as-if the Set is the std::vector
and Pool is the std::allocator
.
3. _TODO: Is there anything special that needs to be done to handle dynamic offsets?_
AFAIK no, only per-non-Vulkan backend class to emulate/track them.
See OpenGL Descriptor Set, it stores a little bit of extra data per each descriptor to track what arguments need to be passed to the multibind functions.
4. _TODO: Things from `IEmulatedDescriptorSet` can be hoisted to `IDescriptorSetLayout`_?
YES, please do
Its a great idea, as IEmulatedDescriptorSet
already provides similar functionality to DescriptorLifetimeTracking for ICPUDescriptorSet
.
But ofc in the nbl::asset
namespace we don't do any pooling (the backing memory for binding records is in the actual ICPUDescriptorSet
not any Pool) cause we want to be more agile with our resources.
Special Flags (TODO)
My idea is to have a revision atomic counter which is incremented each time a descriptor binding with neither AFTER_BIND or WHILE_PENDING flag is updated.
First of all to correctly invalidate a commandbuffer after a "forbidden" descriptor set update, we need to know the version of the descriptor set at the time of first bind, so we need to store a unordered_map<IDescriptorSet,IDescriptorSet::revision_counters_v>
as a member in IGPUCommandBuffer
so we can do some loose validation at submit time (or whenever we check for implicit invalidation).
We can also optimize and not put any DS with that flag on that validation map above, if and only if all bindings have the flags that skip revision counter update.
* `VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT`
This is trivial to support, we don't even validate that all bindings are bound (updated at least once) before we bind or use the descriptor set. If we start validating all bindings are populated, then we should take the flag into account.
* `VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT`
We would simply not increment revision counter when updating any binding with this flag set.
Because this flag only allows us to update bindings which are not used by the command buffer (depending on the presence of PARTIALLY_BOUND
the definitions of used changes from static to dynamic usage but we don't check that so we don't care), we don't need to worry about preventing a previously bound descriptor's reference count dropping to zero while a submit might be needing it to stay alive, because its not supposed to be used anyway.
* `VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT`
The support for this flag is a must, too much code and too many examples rely on this behaviour which on OpenGL works out the box.
To facilitate Update after bind, we don't increment that revision counter when updating any binding with that flag set.
Because IGPUCommandBuffer
will hold a reference to any bound IGPUDescriptorSet
and thanks to the fact that updating a descriptor set currently in use by a pending command buffer is not valid usage (an explcit error), we don't need to worry about "backing up" the old bindings somewhere else.
OpenGL already has such a system and it uses it to detect when it needs to do a rebind (call multibind again) even if the bound DS has not changed. We can leave OpenGL with this default "Update after Bind always on" because the hosted validation will throw errors when we attempt to use this feature without declaring all the flags (this prevents unintended divergence in behaviour between backends while giving no errors w.r.t valid Nabla usage).
Two things to watch out for:
Multiple descriptors with this flag set can be updated concurrently in different threads, though the same descriptor must not be updated concurrently by two threads.
Yet another reason to precompute every offset in the Descriptor Set Layout and make offset getting a const
method which performs a binary search.
Also this constrains us to allocating all storage for a pool up front and not reallocating it during its lifetime.
Descriptors with this flag set can be updated concurrently with the set being bound to a command buffer in another thread
Can't think of any caveats right now.
Until we update a descriptor set we don't have the actual ref_ctd_ptr to resources that need to go into the pool's dynamic arrays, that means we need a way to mark the space reserved for a descriptor set until its updated so that other descriptor sets don't try to claim that space in the meanwhile. If we use nullptr to indicate empty space what is a good default value for reserved space?
Well yes, after allocating the space for structs that may include a smart_refctd_ptr
you need to run a placement new on the storage (right after allocation, in the descriptor set constructor) to call the constructors, which initialize any smart_refctd_ptr
member to nullptr
.
Otherwise you end up with random jank in the smart_refctd_ptr
and when you try to assign to it, it will try to drop the object pointed to by its internal raw-pointer whose value would be jank.
B. Ditch
refctd_dynamic_array
entirely and have a fixed size memory block for each descriptor type instead. Then either use aGeneralpurposeAddressAllocator
orLinearAddressAllocator
depending upon whether freeing of descriptor sets is allowed or not. This will most likely require us to do this (private Discord link). If we do end up going this route I will remove the private link and update the comment with the exact thing that needs to happen.
Do B, and forget about A.
However the discord link doesn't quite work for me.
Lets use Structure of Arrays for the Combined Image Sampler
Since we're using SoA for OpenGL descriptor binding metadata, we can use it everywhere consistenty.
Make your Address Space count in bindings not bytes
This way we have a single index for accessing all SoA data for a binding (even in derived classes like COpenGLDescriptorSet
).
Abandon SCombinedImageSampler
This means using a slightly different SBindingInfo
for CombinedImageSampler bindings (two separate offsets, to keep, one for the image views and one for the samplers, since you want to skip tracking immutable samplers, which are already tracked by the layout).
Lets keep a count of how many descriptors of a given type there are in a Descriptor Set Layout
This makes certain operations easier, such as allocation and iteration.
Use Dependency Injection to get the backing memory in IDescriptorSet
In software engineering, dependency injection is a design pattern in which an object or function receives other objects or functions that it depends on.
Make the IDescriptorSet
declare some virtuals which it does not care about the implementation of.
protected:
virtual core::smart_refctd_ptr<IDescriptor>* getDescriptorRefcountingStorage(const E_DESCRIPTOR_TYPE type) = 0;
virtual core::smart_refctd_ptr<ISampler>* getSamplerRefcountingStorage() = 0;
then use these throughout the common hoisted code to deal with lifetime tracking, i.e.
protected:
// gets called by derived class constructor as soon as setup necessary for `get***RefcountringStorage` is complete
inline void createDescriptors()
{
allocateDescriptors();
std::uninitialized_default_construct_n(getSamplerRefcountingStorage(),m_layout->getMutableSamplerCount());
for (auto type=0u; type<asset::EDT_COUNT; type++)
std::uninitialized_default_construct_n(getDescriptorRefcountingStorage(type),m_layout->getTotalDescriptorCount(type));
}
~IDescriptorSet()
{
std::destroy_n(getSamplerRefcountingStorage(),m_layout->getMutableSamplerCount());
for (auto type=0u; type<asset::EDT_COUNT; type++)
std::destroy_n(getDescriptorRefcountingStorage(type),m_layout->getTotalDescriptorCount(type));
}
The IGPUDescriptorSet
can allocate its offsets into IDescriptorPool
's storage in its static
creation-helper method, pass them to its own constructor, and implement the get**RefcountingStorage
methods as adding these offsets to a m_pool->m_descriptorTracking[type]
arrays.
While ICPUDescriptorSet
can allocate some unique_ptr<core::StorageTrivializer<core::smart_refctd_ptr<IDescriptor>>[]>
arrays in its static create
method and r-value move these to its constructor.
IDescriptorPool
needs to declare IGPUDescriptorSet
a friend
In order for the above to work, the GPU Descriptor Set needs to be able to access internal arrays of IDescriptorPool
, its bad to declare public methods that give access to the bindings to just anyone, so friendship is preferred.
And while we're at it we can make the allocate
and free
methods private.
All descriptors inherit from IDescriptor
which inherits from IReferenceCounted
So all your DLT functions can deal with core::smart_refctd_ptr<IDescriptor>
(and core::smart_refctd_ptr<ISampler>
for samplers), instead of reinterpret_casting to and fro uint8_t*
.
But of course please keep separate arrays (and start counting from 0) for separate descriptor types, so that SoA metadata can be attached by derived classes.
Then for every command that doesn't do weird variable length shenanigans, like CDrawIndexedIndirectCountCmd, you could just do
Wouldn't every command be just fixed size? For example, even for
CPipelineBarrierCmd
(which I consider could be the prime example of a "variable" size command) we can just have acore::smart_refctd_dynamic_array
of images and buffers like this, and then can derive fromIGPUCommandPool::IFixedSizeCommand<CRTP>
.Is this not the ideal way to go about this? Now that I think about it, it might not be the ideal way because the dynamically allocated memory via
core::smart_refctd_dynamic_array
would not be aligned to cache line size and may cause false sharing --which would defeat the entire purpose of making the commands align to cache line size. So we can try something like this (demonstrated forCBindDescriptorSetsCmd
). It does seem a bit strange though.
This is the correct way https://github.com/Devsh-Graphics-Programming/Nabla/blob/915dbc4f3374389ba27b1fe40631f206b91d574a/include/nbl/video/IGPUCommandPool.h#L386
but obviously a bad example because engine max descriptor set count is hardcoded to 4.
Commands affected by this:
CBindDescriptorSetsCmd
,CPipelineBarrierCmd
,CNamedBufferSubDataCmd
,IGPUCommandPool::CExecuteCommands
,COpenGLCommandPool::CExecuteCommands
,CWaitEventsCmd
,CWriteAccelerationStructurePropertiesCmd
,CBuildAccelerationStructuresCmd
.
Why would CNamedBufferSubDataCmd
be affected?
but obviously a bad example because engine max descriptor set count is hardcoded to 4.
Okay I will change this to a flat array.
Commands affected by this:
CBindDescriptorSetsCmd
,CPipelineBarrierCmd
,CNamedBufferSubDataCmd
,IGPUCommandPool::CExecuteCommands
,COpenGLCommandPool::CExecuteCommands
,CWaitEventsCmd
,CWriteAccelerationStructurePropertiesCmd
,CBuildAccelerationStructuresCmd
.Why would
CNamedBufferSubDataCmd
be affected?
It is because of this member https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652
but obviously a bad example because engine max descriptor set count is hardcoded to 4.
Okay I will change this to a flat array.
Commands affected by this:
CBindDescriptorSetsCmd
,CPipelineBarrierCmd
,CNamedBufferSubDataCmd
,IGPUCommandPool::CExecuteCommands
,COpenGLCommandPool::CExecuteCommands
,CWaitEventsCmd
,CWriteAccelerationStructurePropertiesCmd
,CBuildAccelerationStructuresCmd
.Why would
CNamedBufferSubDataCmd
be affected?It is because of this member
https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652
Ah ok you meant COpenGLCommandPool::CNamedBufferSubDataCmd
not a generic tape command.
but obviously a bad example because engine max descriptor set count is hardcoded to 4.
Okay I will change this to a flat array.
Commands affected by this:
CBindDescriptorSetsCmd
,CPipelineBarrierCmd
,CNamedBufferSubDataCmd
,IGPUCommandPool::CExecuteCommands
,COpenGLCommandPool::CExecuteCommands
,CWaitEventsCmd
,CWriteAccelerationStructurePropertiesCmd
,CBuildAccelerationStructuresCmd
.Why would
CNamedBufferSubDataCmd
be affected?It is because of this member https://github.com/Devsh-Graphics-Programming/Nabla/blob/046481ee80c8906170deb001cdb9a8b7932553b7/src/nbl/video/COpenGLCommandPool.h#L652
Ah ok you meant
COpenGLCommandPool::CNamedBufferSubDataCmd
not a generic tape command.
There isn't a generic CNamedBufferSubDataCmd
. I don't think its required because whichever IGPUCommandBuffer
method eventually emplaces this command on the backend-specific tape would have already saved a reference to the ref counted resource in the generic tape.
Stuff Carried over from #408
https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r940636518 https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r1040830105 https://github.com/Devsh-Graphics-Programming/Nabla/pull/408#discussion_r940693463
Some ideas to reduce complexity of Segment List of Commands
Now that we have explored the problem space more thoroughly, allow me to suggest a better design, which I believe makes things a lot cleaner without adding significant overhead (if any):
1. Get rid of CCommandSegment::Iterator
entirely, and add ~CCommandSegment
Make CCommandSegment
responsible for freeing all of its commands by adding a destructor, like:
~CCommandSegment()
{
auto* cmd = reinterpret_cast<ICommand*>(m_data);
while ((cmd != segment_end()) || (cmd->getSize() != 0))
{
auto* nextCmd = reinterpret_cast<ICommand*>(reinterpret_cast<uint8_t*>(cmd) + cmd->getSize());
cmd->~ICommand();
cmd = nextCmd;
}
}
Note: this can be extended easily. For example, the GL backend required to execute commands, so we could've added a CCommandSegment::execute
which just executes all of the commands in the segment.
This will make the deleteCommandSegmentList
routine as simple as:
void deleteCommandSegmentList(CCommandSegment*& head, CCommandSegment*& tail)
{
if (!head)
return;
auto freeSegment = [this](CCommandSegment*& segment)
{
if (segment)
{
segment->~CCommandSegment();
m_pool.deallocate(segment, COMMAND_SEGMENT_SIZE);
segment = nullptr;
}
};
for (auto& segment = head; segment;)
{
auto nextSegment = segment->getNext();
freeSegment(segment);
segment = nextSegment;
}
// No need to nullify head.
if (tail)
tail = nullptr;
}
2. The List-of-List data structure to facilitate Command Pool reset
As mentioned here https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r1041104593, we make a doubly linked list of segment list heads to facilitate resetting of the command pool. Including this comment as well https://github.com/Devsh-Graphics-Programming/Nabla/pull/345#discussion_r1040939380, I think we can make a reusable data structure as follows:
class CCommandListPool
{
public:
CCommandListPool() : m_pool(COMMAND_SEGMENTS_PER_BLOCK*COMMAND_SEGMENT_SIZE, 0u, MAX_COMMAND_SEGMENT_BLOCK_COUNT, MIN_POOL_ALLOC_SIZE) {}
bool appendToList(CCommandSegment*& head, CCommandSegment*& tail)
{
auto segment = m_pool.emplace<CCommandSegment>();
if (!segment)
{
assert(false);
return false;
}
if (tail)
{
tail->setNext(segment);
tail = tail->getNext();
}
else
{
assert(!head && "List should've been empty.");
tail = segment;
head = segment;
// Add to the list to pool of lists.
{
if (!m_head && !m_tail)
{
m_head = segment;
m_tail = segment;
}
else
{
assert(m_head && m_tail);
m_tail->setNextHead(segment);
segment->setPrevHead(m_tail);
m_tail = m_tail->getNextHead();
}
}
}
return bool(segment);
}
void deleteList(CCommandSegment*& head, CCommandSegment*& tail)
{
if (!head)
return;
// Remove the list from pool of lists.
{
assert(m_head && m_tail && "Cannot have a freely dangling command list.");
auto oneBefore = head->getPrevHead();
auto oneAfter = head->getNextHead();
if (oneAfter)
oneAfter->setPrevHead(oneBefore);
if (oneBefore)
oneBefore->setNextHead(oneAfter);
if (!oneBefore && !oneAfter)
{
m_head = nullptr;
m_tail = nullptr;
}
}
auto freeSegment = [this](CCommandSegment*& segment)
{
if (segment)
{
segment->~CCommandSegment();
m_pool.deallocate(segment, COMMAND_SEGMENT_SIZE);
segment = nullptr;
}
};
for (auto& segment = head; segment;)
{
auto nextSegment = segment->getNext();
freeSegment(segment);
segment = nextSegment;
}
// No need to nullify head.
if (tail)
tail = nullptr;
}
void clear()
{
CCommandSegment* dummy = nullptr;
for (auto* currHead = m_head; currHead;)
{
auto* nextHead = currHead->getNextHead();
// We don't (and also can't) nullify the tail here because when the command buffer detects that its parent pool has been resetted
// it nullifies both head and tail itself.
deleteList(currHead, dummy);
currHead = nextHead;
}
m_head = nullptr;
m_tail = nullptr;
}
private:
CCommandSegment* m_head = nullptr;
CCommandSegment* m_tail = nullptr;
core::CMemoryPool<core::PoolAddressAllocator<uint32_t>, core::default_aligned_allocator, false, uint32_t> m_pool;
};
One CCommandListPool
per IGPUCommandPool
which will allocate CCommandList
s (essentially a pair of CCommandSegment*
), per IGPUCommandBuffer
.
After the changes are made, the IGPUCommandPool::emplace
and IGPUCommandPool::deleteCommandSegmentList
methods become trivial:
template <typename Cmd, typename... Args>
Cmd* emplace(CCommandSegment*& segmentListHead, CCommandSegment*& segmentListTail, Args&&... args)
{
if (!segmentListTail && !m_commandListPool.appendToList(segmentListHead, segmentListTail))
return nullptr;
auto newCmd = [&]() -> Cmd*
{
void* cmdMem = segmentListTail->allocate<Cmd>(args...);
if (!cmdMem)
return nullptr;
return new (cmdMem) Cmd(std::forward<Args>(args)...);
};
auto cmd = newCmd();
if (!cmd)
{
if (!m_commandListPool.appendToList(segmentListHead, segmentListTail))
return nullptr;
cmd = newCmd();
if (!cmd)
{
assert(false);
return nullptr;
}
}
return cmd;
}
void deleteCommandSegmentList(CCommandSegment*& segmentListHead, CCommandSegment*& segmentListTail)
{
m_commandListPool.deleteList(segmentListHead, segmentListTail);
}
So might as well move them to IGPUCommandBuffer
, and make it a friend
of IGPUCommandPool
so it can access the data structure CCommandListPool
.
The function to which most of the ~IGPUDescriptorSet
needs to move to, IDescriptorPool::deleteSetStorage
can be called from any thread.
A similar concern exists for IGPUCommandBuffer::releaseResourcesBackToPool()
which may be called by ~IGPUCommandBuffer
.
This should be easy to solve with a mutex in the interface class, but only after both functions are written so we know what to protect (probably only vk
calls using the pool as an argument + allocators).
We can still run all constructors, initializers and destructors of the allocated storage without a mutex.
Note that this is only limited thread-safeness, not a full-on "provide all external synchronization guarantees for every method as required by Vulkan spec".
P.S. Also why doesn't your CVulkanCommandBuffer
call vkFreeCommandBuffers
!?