Bindless support
Bindless is quite messy in every api, so need to design nice top-level api with reasonable underlying implementation.
GLSL
GLSL is main language in Tempest, so dedicated section is must. GLSL features 2 ways:
- Unbound array of descriptors. - nice and easy to use
- Device address. - not portable to metal; hard to track hazards
layout(binding = 0) uniform sampler2D tex[]; // unbound array of textures
layout(binding = 1) uniform sampler2D img[]; // another unbound array of textures
layout(binding = 1, std140) readonly buffer Input {
vec4 val[];
} ssbo[]; // unbound array of buffers
Engine-side
std::vector<const Tempest::Texture2d*> ptex(tex.size());
for(size_t i=0; i<tex.size(); ++i)
ptex[i] = &tex[i];
auto desc = device.descriptors(pso);
desc.set(0,ptex); // taking vector or c-array
Doesn't fit the engine perfectly - need to add support for sampler and textures(non-combined) on top of it.
Vulkan
Caps-list:
VkPhysicalDeviceDescriptorIndexingFeatures::runtimeDescriptorArray; // support for unbound array declaration (tex[])
// Support of nonuniformEXT, per resource-type
VkPhysicalDeviceDescriptorIndexingFeatures::shaderUniformBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderSampledImageArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageImageArrayNonUniformIndexing;
VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT can be used (in theory), but only for the very last binding in descriptor set, what doesn't fit GLSL side.
Alternatively, it's sufficient to use VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT_EXT with very-large descriptor array. Size of array has to be defined in C++ upfront, at VkDescriptorSetLayout creation.
Current implementation of Tempest can recreate VkDescriptorSetLayout and VkDescriptorSet on a go, if preallocated array is not big enough. But it also requires reallocation of VkPipeline, at runtime, based of descriptor set size - this is hard to implement without extra performance cost.
VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - useless by itself, but there is a special behavior for this type of descriptors in spec:
... layouts which may be much higher than the pre-existing limits. The old limits only count descriptors in non-updateAfterBind descriptor set layouts, and the new limits count descriptors in all descriptor set layouts in the pipeline layout.
maxUpdateAfterBindDescriptorsInAllPools = 500,000+ // Eh, probably can't do anything sensible about it
maxPerStageUpdateAfterBindResources = 500,000+
maxPerStageDescriptorUpdateAfterBindSamplers = 500,000+
maxPerStageDescriptorUpdateAfterBindUniformBuffers = 12+
maxPerStageDescriptorUpdateAfterBindStorageBuffers = 500,000+
maxPerStageDescriptorUpdateAfterBindSampledImages = 500,000+
maxPerStageDescriptorUpdateAfterBindStorageImages = 500,000+
maxPerStageDescriptorUpdateAfterBindAccelerationStructures = 500,000+
maxDescriptorSetUpdateAfterBindSamplers = 500,000+
maxDescriptorSetUpdateAfterBindUniformBuffers = 72+ // n × PerStage
maxDescriptorSetUpdateAfterBindStorageBuffers = 500,000+
maxDescriptorSetUpdateAfterBindSampledImages = 500,000+
maxDescriptorSetUpdateAfterBindStorageImages = 500,000+
maxDescriptorSetUpdateAfterBindAccelerationStructures = 500,000+
Naturally as there is only single descriptor-set, can just take min of PerStage and DescriptorSet limits.
Other limits to concern (obsolete):
VkPhysicalDeviceLimits::maxPerStageDescriptorSamplers = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers = 12+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers = 4+;
VkPhysicalDeviceLimits::maxPerStageDescriptorSampledImages = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageImages = 4+;
VkPhysicalDeviceLimits::maxPerStageResources = 128^2+;
VkPhysicalDeviceLimits::maxDescriptorSetSamplers = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers = 72^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers = 24^8+;
VkPhysicalDeviceLimits::maxDescriptorSetSampledImages = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageImages = 24^8+;
With such limits, realloc has to manage per-stage + per-resource + per_set limit somehow.
DirectX12
Note: Tempest uses spirv-cross to generate HLSL, except produced HLSL is not valid:
// error: more than one unbounded resource (ssbo and tex) in space 0
ByteAddressBuffer ssbo[] : register(t1, space0);
Texture2D<float4> tex[] : register(t0, space0);
SamplerState _tex_sampler[] : register(s0, space0);
RWTexture2D<unorm float4> ret : register(u2, space0);
Apparently spirv-cross follows VARIABLE_DESCRIPTOR_COUNT workflow. This maps directly to
D3D12_DESCRIPTOR_HEAP_DESC::NumDescriptors = -1 with same limitation of only one runtime array per set. I theory can workaround with instrumenting spir-v:
OpDecorate %tex DescriptorSet 0 -> OpDecorate %tex DescriptorSet UNIQ_SPACE
Limits:
| Resources Available to the Pipeline | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| Feature levels | 11.0+ | 11.0+ | 11.1+ |
| Maximum number of descriptors in a CBV/SRV/UAV heap used for rendering | 1,000,000 | 1,000,000 | 1,000,000+ |
| Maximum number of CBV in all descriptor tables per shader stage | 14 | 14 | full heap |
| Maximum number of SRV in all descriptor tables per shader stage | 128 | full heap | full heap |
| Maximum number of UAV in all descriptor tables per shader stage | 64 for feature levels 11.1+ 8 for feature level 11 | 64 | full heap |
| Maximum number of Samplers in all descriptor tables per shader stage | 16 | 2048 | 2048 |
ID3D12GraphicsCommandList::SetDescriptorHeaps
Only one descriptor heap of each type can be set at one time, which means a maximum of 2 heaps (one sampler, one CBV/SRV/UAV) can be set at one time.
DX12 is a bit awkward, because limit is shared for all types of descriptors, except sampler. Probably can "just" split heap in equal partitions.
Metal [3]
Limits (per-app resources available at any given time are):
| Resources Available to the Pipeline | Tier1(ios) | Tier1 | Tier2 |
|---|---|---|---|
| Buffers(and TLAS'es) | 31 | 64 | 500,000 |
| Textures | 31 | 128 | 500,000 |
| Samplers | 16 | 16 | 2048 |
For both tiers, the maximum number of argument buffer entries in each function argument table is 8.
*Writable textures aren’t supported within an argument buffer. Tier 1 argument buffers can’t be accessed through pointer indexing, nor can they include pointers to other argument buffers. Tier 2 argument buffers can be accessed through pointer indexing, as shown in the following example.
T1 argument are practically same as descriptor-set's in vulkan and have nothing usefull in it. T2 allows for pointer-indexing and can be leveraged for bindless-array.
Sources: https://gist.github.com/DethRaid/0171f3cfcce51950ee4ef96c64f59617 https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_descriptor_range https://learn.microsoft.com/en-us/windows/win32/direct3d12/hardware-support?redirectedfrom=MSDN https://developer.apple.com/documentation/metal/buffers/about_argument_buffers https://developer.apple.com/documentation/metal/buffers/managing_groups_of_resources_with_argument_buffers
GLSL
Unbound array of descriptors has 2 meanings:
Base spec:
uniform sampler2D tex[] -> OpTypeArray %8 %uint_1
size of array depend on highest index that been used in code.
GL_EXT_nonuniform_qualifier:
May work same as base spec, if runtime-index is not in use, and otherwise:
uniform sampler2D tex[] ->OpTypeRuntimeArray %8 // legal only if driver supports descriptor-indexing
Engine side
[wip] Generally metal-like model is good middle ground:
maxUAV = 500'000; // ssbo + tlas + imageStore
maxTextures = 500'000;
maxSamplers = 2048;
// can skip maxUbo - hard in vulkan and not very usefull
// combined image consumes both Texture and Samplers limits
In DX UAX/Tex - can be achieved by splitting heap in 2 parts In Vulkan UAV is probably min for all applicable resources
~~TODO~~, for DX12:
- handle case when only sampler is in descriptor-set (
pDescriptorHeaps[0]==nullptr)
error: number of textures with read_write access exceeds maximum supported (8)
apparently undocumented. MoltenVK allows 500k, if argument buffer tier 2 is supported(why?) and 8 otherwise
New Mac/iOS feature to track residency of resources: https://developer.apple.com/documentation/metal/resource_fundamentals/simplifying_gpu_resource_management_with_residency_sets?language=objc
According to apple:
You don’t need to call the following methods for any allocation in a residency set that you associate with the command buffer: useResource, useHeap