Tempest Bindless support

Bindless is quite messy in every api, so need to design nice top-level api with reasonable underlying implementation.

GLSL

GLSL is main language in Tempest, so dedicated section is must. GLSL features 2 ways:

Unbound array of descriptors. - nice and easy to use
Device address. - not portable to metal; hard to track hazards

layout(binding = 0) uniform sampler2D tex[]; // unbound array of textures
layout(binding = 1) uniform sampler2D img[]; // another unbound array of textures
layout(binding = 1, std140) readonly buffer Input {
  vec4 val[];
  } ssbo[]; // unbound array of buffers

Engine-side

std::vector<const Tempest::Texture2d*> ptex(tex.size());
for(size_t i=0; i<tex.size(); ++i)
  ptex[i] = &tex[i];
auto desc = device.descriptors(pso);
desc.set(0,ptex); // taking vector or c-array

Doesn't fit the engine perfectly - need to add support for sampler and textures(non-combined) on top of it.

Vulkan

Caps-list:

VkPhysicalDeviceDescriptorIndexingFeatures::runtimeDescriptorArray; // support for unbound array declaration (tex[])
// Support of nonuniformEXT, per resource-type 
VkPhysicalDeviceDescriptorIndexingFeatures::shaderUniformBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderSampledImageArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageImageArrayNonUniformIndexing;

VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT can be used (in theory), but only for the very last binding in descriptor set, what doesn't fit GLSL side. Alternatively, it's sufficient to use VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT_EXT with very-large descriptor array. Size of array has to be defined in C++ upfront, at VkDescriptorSetLayout creation. Current implementation of Tempest can recreate VkDescriptorSetLayout and VkDescriptorSet on a go, if preallocated array is not big enough. But it also requires reallocation of VkPipeline, at runtime, based of descriptor set size - this is hard to implement without extra performance cost.

VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - useless by itself, but there is a special behavior for this type of descriptors in spec:

... layouts which may be much higher than the pre-existing limits. The old limits only count descriptors in non-updateAfterBind descriptor set layouts, and the new limits count descriptors in all descriptor set layouts in the pipeline layout.

maxUpdateAfterBindDescriptorsInAllPools = 500,000+ // Eh, probably can't do anything sensible about it
maxPerStageUpdateAfterBindResources   = 500,000+

maxPerStageDescriptorUpdateAfterBindSamplers = 500,000+
maxPerStageDescriptorUpdateAfterBindUniformBuffers = 12+
maxPerStageDescriptorUpdateAfterBindStorageBuffers = 500,000+
maxPerStageDescriptorUpdateAfterBindSampledImages = 500,000+
maxPerStageDescriptorUpdateAfterBindStorageImages = 500,000+
maxPerStageDescriptorUpdateAfterBindAccelerationStructures = 500,000+

maxDescriptorSetUpdateAfterBindSamplers = 500,000+
maxDescriptorSetUpdateAfterBindUniformBuffers = 72+ // n × PerStage
maxDescriptorSetUpdateAfterBindStorageBuffers = 500,000+
maxDescriptorSetUpdateAfterBindSampledImages = 500,000+
maxDescriptorSetUpdateAfterBindStorageImages = 500,000+
maxDescriptorSetUpdateAfterBindAccelerationStructures = 500,000+

Naturally as there is only single descriptor-set, can just take min of PerStage and DescriptorSet limits.

Other limits to concern (obsolete):

VkPhysicalDeviceLimits::maxPerStageDescriptorSamplers = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers = 12+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers = 4+;
VkPhysicalDeviceLimits::maxPerStageDescriptorSampledImages = 16+;
VkPhysicalDeviceLimits::maxPerStageDescriptorStorageImages = 4+;
VkPhysicalDeviceLimits::maxPerStageResources = 128^2+;

VkPhysicalDeviceLimits::maxDescriptorSetSamplers = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers = 72^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers = 24^8+;
VkPhysicalDeviceLimits::maxDescriptorSetSampledImages = 96^8+;
VkPhysicalDeviceLimits::maxDescriptorSetStorageImages = 24^8+;

With such limits, realloc has to manage per-stage + per-resource + per_set limit somehow.

DirectX12

Note: Tempest uses spirv-cross to generate HLSL, except produced HLSL is not valid:

// error: more than one unbounded resource (ssbo and tex) in space 0
ByteAddressBuffer         ssbo[]        : register(t1, space0);
Texture2D<float4>         tex[]         : register(t0, space0);
SamplerState             _tex_sampler[] : register(s0, space0);
RWTexture2D<unorm float4> ret           : register(u2, space0);

Apparently spirv-cross follows VARIABLE_DESCRIPTOR_COUNT workflow. This maps directly to D3D12_DESCRIPTOR_HEAP_DESC::NumDescriptors = -1 with same limitation of only one runtime array per set. I theory can workaround with instrumenting spir-v: OpDecorate %tex DescriptorSet 0 -> OpDecorate %tex DescriptorSet UNIQ_SPACE

Limits:

Resources Available to the Pipeline	Tier 1	Tier 2	Tier 3
Feature levels	11.0+	11.0+	11.1+
Maximum number of descriptors in a CBV/SRV/UAV heap used for rendering	1,000,000	1,000,000	1,000,000+
Maximum number of CBV in all descriptor tables per shader stage	14	14	full heap
Maximum number of SRV in all descriptor tables per shader stage	128	full heap	full heap
Maximum number of UAV in all descriptor tables per shader stage	64 for feature levels 11.1+ 8 for feature level 11	64	full heap
Maximum number of Samplers in all descriptor tables per shader stage	16	2048	2048

ID3D12GraphicsCommandList::SetDescriptorHeaps Only one descriptor heap of each type can be set at one time, which means a maximum of 2 heaps (one sampler, one CBV/SRV/UAV) can be set at one time. DX12 is a bit awkward, because limit is shared for all types of descriptors, except sampler. Probably can "just" split heap in equal partitions.

Metal [3]

Limits (per-app resources available at any given time are):

Resources Available to the Pipeline	Tier1(ios)	Tier1	Tier2
Buffers(and TLAS'es)	31	64	500,000
Textures	31	128	500,000
Samplers	16	16	2048

For both tiers, the maximum number of argument buffer entries in each function argument table is 8.

*Writable textures aren’t supported within an argument buffer. Tier 1 argument buffers can’t be accessed through pointer indexing, nor can they include pointers to other argument buffers. Tier 2 argument buffers can be accessed through pointer indexing, as shown in the following example.

T1 argument are practically same as descriptor-set's in vulkan and have nothing usefull in it. T2 allows for pointer-indexing and can be leveraged for bindless-array.

Sources: https://gist.github.com/DethRaid/0171f3cfcce51950ee4ef96c64f59617 https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_descriptor_range https://learn.microsoft.com/en-us/windows/win32/direct3d12/hardware-support?redirectedfrom=MSDN https://developer.apple.com/documentation/metal/buffers/about_argument_buffers https://developer.apple.com/documentation/metal/buffers/managing_groups_of_resources_with_argument_buffers

GLSL

Unbound array of descriptors has 2 meanings: Base spec: uniform sampler2D tex[] -> OpTypeArray %8 %uint_1 size of array depend on highest index that been used in code.

GL_EXT_nonuniform_qualifier: May work same as base spec, if runtime-index is not in use, and otherwise: uniform sampler2D tex[] ->OpTypeRuntimeArray %8 // legal only if driver supports descriptor-indexing

Engine side

[wip] Generally metal-like model is good middle ground:

maxUAV      = 500'000; // ssbo + tlas + imageStore
maxTextures = 500'000;
maxSamplers = 2048;
// can skip maxUbo - hard in vulkan and not very usefull
// combined image consumes both Texture and Samplers limits

In DX UAX/Tex - can be achieved by splitting heap in 2 parts In Vulkan UAV is probably min for all applicable resources

Jun 19 '22 12:06 Try

~~TODO~~, for DX12:

handle case when only sampler is in descriptor-set (pDescriptorHeaps[0]==nullptr)

Mar 28 '23 19:03 Try

error: number of textures with read_write access exceeds maximum supported (8)

apparently undocumented. MoltenVK allows 500k, if argument buffer tier 2 is supported(why?) and 8 otherwise

Apr 22 '24 19:04 Try

New Mac/iOS feature to track residency of resources: https://developer.apple.com/documentation/metal/resource_fundamentals/simplifying_gpu_resource_management_with_residency_sets?language=objc

According to apple: You don’t need to call the following methods for any allocation in a residency set that you associate with the command buffer: useResource, useHeap

Jul 02 '24 10:07 Try