trafficstars

Renderers that can assume modern GPU hardware, and especially those doing ray tracing, may want to use “bindless” idioms. Slang can/should simplify this task with a carefully designed language feature.

What is Bindless?

The key to any bindless approach is that there exists some kind of handle type that can in principle refer to (more or less) any texture/buffer/resource at the use site. This allows references to resources to be encoded as “plain old data,” which can improve the simplicity and flexibility of engine code for passing data to the GPU.

OpenGL has long supported bindless directly via extensions, with the handles for typed resources being opaque 64-bit values, and the handles for flat buffers being GPU virtual addresses. This amounts to deciding by fiat that the Texture2D type is 64 bits of plain old data, and subsequently allowing it to be packed in buffers however users like. This strictly increases the flexibility of GLSL by allowing types that used to be illegal (e.g., a struct with a texture in it).

What’s the problem?

In contrast to OpenGL with extensions, achieving a “bindless” approach on D3D12 or Vulkan requires a lot more explicit work on the user’s part, and tends to have a large impact on shader code. A typical D3D12 approach would be to maintain a single large descriptor heap for resource views, and to represent a “handle” to a view as an index (potentially just 32 bits) into that heap.

A D3D12 application then needs to bind a monolithic descriptor table that spans the entire heap, and declare various unbounded-size arrays in shader code (e.g., Texture2D gAllMyTextures[];, RWStructuredBuffer<Foo> gAllMyRWStructuredBuffersOfFoo[];, etc.) that will be backed by that table. The existence of parameterized types like RWStructureBuffer<_> means that the number of such declarations required is unbounded and depends on the shader code that gets authored. Any shader code will need to explicitly use, e.g., uint instead of Texture2D in data structures, and then manually indirect through the appropriate global array when it needs to get from the handle to the resource it references.

(The above ignores a bunch of API-specific details which make it harder to write code that works seamlessly across D3D12 and Vulkan)

All of this complexity makes it hard to write a re-usable module that might include data types mixing resources and uniform data, if we want the module to work under both traditional binding models and application-emulated bindless. Basic questions like “should this struct field be a uint or a Texture2D?” can’t be answered without knowing how your module will be used.

Ray Tracing Requirements

Feature-complete cross compilation of HLSL/Slang ray tracing shaders for Vulkan requires some level of compiler support for bindless, making it desirable to expose the implementation mechanisms in a more broadly useful fashion.

The DirectX Ray Tracing (DXR) interface includes a notion of a “local root signature” that allows shader parameters (including resources and parameter blocks) to be bound for use by a specific shader table entry. The matching feature in the Vulkan ray tracing extension (the "shader record") only supports plain old data, so that a faithful translation of HLSL/Slang code would need to replace resources in the local root signature with bindless handles in the output SPIR-V.

Proposal

We propose to add a new built-in type constructor Bindless<T> that can wrap any type T. The only operation Bindless<T> will support is (implicit) deference to get a value of type T (similar to how ConstantBuffer<T> and ParamterBlock<T> work).

No matter what T is, Bindless<T> will be plain old data. The exact translation may depend on the target, but for D3D12 and Vulkan:

Bindless<Texture2D> will translate to uint, and similarly for any resource types
Bindless<float4> will translate to float4, and similarly for any type that is already plain old data.
- As a corollary of the above, Bindless<Bindless<X>> will translate the same as Bindless<X>.
Bindless<X[N]> will translate the same as Bindless<X>[N].
Bindless<S> where S is a struct type that transitively contains resource types will translate to a new struct S_B where for each field F f; in S there is a field Bindless<F> f; in S_B.

The compiler can then synthesize a function to translate a Bindless<X> into an X for any type X. This operation can be defined inductively based on X, with the only interesting case being for resource types like Texture2D where the compiler should synthesize (on demand) a global array of the given resource type to be used for indexing operation.

Interaction with Ray Tracing Cross-Compilation

It is probably obvious by this point, but when translating a DXR ray-tracing shader that puts data of type X in its local root signature, the equivalent Vulkan shader record should contain data of type Bindless<X>. Locations in the code that reference parameters in the local root signature should instead apply the Bindless<T>-to-T translation to the matching field in the shader record.

The behavior of the cross-compilation should be documented in terms of this translation, so that it doesn't come across as special-case magic.

Example Usage

With this Bindless<T> functionality in place, a user can easily opt in to using bindless for any module/feature as a late binding decision. For example, given a module with code like:

// MyFeature.slang
struct MyFeatureParams { Texture2D t; float4 v; ... }
float4 computeStuff( MyFeatureParams p, float2 uv, ... ) { ... }

A shader entry point can do either of the following with equal ease:

import MyFeature;

// Traditional "bind-full" usage
ParameterBlock<MyFeatureParams> gParams;
float4 main(...) { return computeStuff(gParams, ...); }

// Alternative bindless usage:
ConstantBuffer<Bindless<MyFeatureParams>> gParams;
float4 main(...) { return computeStuff(gParams, ...); }

Challenge: reflection and binding for the big arrays

The main challenge that emulated "bindless" on top of current D3D/VK creates is that there is still a lot of binding going on for the big arrays of resources. Whatever code Slang generates needs to be able to mesh with application-side policy for how it wants to set up and bind those big arrays (which will tend to vary between D3D12 and Vulkan).

In the D3D12 case, each of the implicitly-synthesized unbounded arrays needs to have a distinct space allocated for it. The application will need to map their whole-heap descriptor table(s) (potentially one for CBVs/SRVs/UAVs and another for samplers) to those spaces, but the total number of spaces that need to be covered can, in general, depend on how many resource types the user code accesses.

In the Vulkan case, all of the unbounded arrays can belong to a single set, and arrays of compatible resource types (those that map to the same VkDescriptorType) can share the same binding. Ideally we should ensure that any arrays for a specific descriptor type use a fixed binding (and ideally this should match the VkDescriptorType enumerant value in the case of core Vulkan descriptor types). Unfortunately, once extensions are in the mix it is hard to guarantee a fixed and stable mapping of descriptor type to binding, so this becomes something the application would need to query via reflection.

In both cases, the application can't create the host-side API objects that match up to these unbounded arrays without some knowledge that results from shader compilation, and furthermore that information may depend on specialization decisions so that it would only be query-able on output kernels and not based solely on front-end compilation (where most of the reflection information comes from).

It is reasonable for the user to want to specify the set that bindless access should use in Vulkan, or the starting space to use in D3D12 (with the understanding that the compiler will reserve all spaces from there to infinity for bindless), and possibly also the mapping from descriptor types to bindings they would like to assume for Vulkan. All of this would create a lot of messy requirements at the command-line or API interface.

Alternatives and Open Issues

An alternative to the messiness of specifying this information via API would be to support an explicit notion of a BindlessHeap type in shader code, which can be bound to an explicit space/set, and such that translation from Bindless<T> to T is provided by a BindlessHeap rather than being an ambient capability. I suspect that this would complicate things over-much for typical renderers, but it is worth consideration.

We should try to be careful about how we define the "handle" type for Bindless<T>, since most near-term users will want 32-bit handles since that is the most D3D12 descriptor heaps can support, but 64-bit handles seem like a foregone conclusion in the long run if we ever get API-supported bindless in D3D and Vulkan.

Once an engine moves to a bindless model, they will increasingly want to be able to load data in a more random-access fashion. It is an orthogonal feature to this proposal, but it would be nice to have support for a Load<T> operation on *BytesAddressBuffer that can load any plain-old-data type T from an arbitrary (but suitably aligned) address T. Any Bindless<X> type should support this operation automatically (without users needing to write extra code).

A missing feature here is a notion equivalent to a "flat" pointer into GPU memory, which is what we'd ideally want, e.g., a Bindless<ByteAddressBuffer> to translate into. We could consider trying to emulate a pointer as a combination of a 32-bit buffer handle and a 32-bit offset, but that seems less than ideal in practice (because we'd have to deal with SRV-vs-UAV cases, etc.). It would be better to wait for direct API support for pointers, and then figure out how to cleanly pick the right translation based on declared API support in a target.

A major alternative to consider is to have bindless-ness be a global switch, rather than something explicitly and locally declared. A user might like to declare that they want to use bindless globally across all their shaders and have it Just Work even for existing entry points. The biggest impediment to this is that it opens up a lot of questions about how global shader parameters are to be communicated, and all of these amount to policy decisions that would require fine-grained agreement between the application and compiler.

Given the user access to a single mechanism, rather than a sweeping global policy, allows the compiler to avoid as many policy choices as possible (though by no means all of them).

Jan 05 '19 20:01 tangent-vector

In the example given the Bindless transformation is enabled by wrapping the type passed as cb with the Bindless type transformation.

Would it be desirable/possible to make bindless or not an external decision communicated to the compiler via the API? Doing so would mean the shader code can be written once for use in both scenarios. That a renderer can decide how it want's to communicate with shader.

Jan 09 '19 14:01 jsmall-zzz

Would it be desirable/possible to make bindless or not an external decision communicated to the compiler via the API?

I mention this briefly (the second-to-last paragraph). I agree that it seems attractive since shaders can be written independent of the engine policy around bindless. The catch seems to be in determining what an engine wants when they decide to bindless-ify an ordinary shader. I'm going to do a stream of consciousness here just to see whether I run into a roadblock, or if this is actually easier than I assume.

The current Slang semantics are, more or less, to act as if all the global-scope shader parameters were actually declared in an implicit struct G, and then to compute layout for ParameterBlock<G>. Explicit bindings mess all of that up, but it is still an instructive mental model.

Note 1: this mental model Just Works even with nested ParameterBlocks because layout works in terms of recursively "flattening" out any parameter blocks or constant buffers that contain resources, so the result of implicit layout will be a first (optional) parameter block for any global declarations that didn't go into an explicit block, followed by any of the explicit blocks nested inside of it.

Note 2: in the case of D3D12 the compiler policy for layout does not interfer with the application's ability to customize the root signature for performance. They can change a descriptor table into a bunch of root descritptors, or a constant buffer into a bunch of root constants. This means that a bunch of performance-critical decisions around the shader parameter passing don't require the shader compiler to get involved (aside: there are fewer such degrees of freedom in Vulkan, so that the direct mapping of ParameterBlocks to descriptor sets is about as good as can be done).

The naive approach when a user says "bindless-ify my shader" would be to instead compute the layout as for ParameterBlock<Bindless<G>> which would more or less equivalent to ConstantBuffer<G>. The resulting layout would amount to a single descriptor table/set with a single constant buffer (followed by whatever data is required to feed the bindless machinery for the chosen API).

A D3D programmer would almost certainly want to map that layout to use root constants instead. A Vulkan programmer might like to do the same, but that would then require compiler involvement (Slang would need to emit the appropriate layout decoration). That means we already have one additional degree of freedom that the API would need to support: do you want Bindless<G> or RootConstant<Bindless<G>> (pretending that we also add the latter)?

One thing that this discussion is bringing up for me is that I left out a major detail when I described what Bindless<_> does.

It seems clear that Bindless<ConstantBuffer<T>> should map to a single bindless handle, which indicates the index of a ConstantBuffer<T>. There's a subtle (and annoying) question, though, of whether that should really be an index of a ConstantBuffer<Bindless<T>>, which of course only matters when T contains any resources.

So if we have:

struct X { float4 a; Texure2D b; }

We need to be pick one of three options:

We can decide that Bindless<ConstantBuffer<X>> amounts to two indices: one that indicates the index of the ConstantBuffer<X_stripped> that contains the a field, and one that indices the index for the b texture. This keeps the in-memory constant buffer layout consistent with the non-bindless case, but makes it more complicated to fill in memory that refers to such a buffer.
We can decide that Bindless<ConstantBuffer<X>> amounts to a single index for a ConstantBuffer<Bindless<X>>, which simplifies the logic for emitting a reference to such a buffer, but makes it impossible to fill in a constant buffer of an X value that can be used in both bindless and non-bindless contexts.

All of this applies directly to Bindless<ParameterBlock<X>>. We can't form a bindless index for a descriptor table/set because the APIs don't allow indexing over tables/sets, so it needs to translate in a way that more closely matches a constant buffer.

The parameter block case adds a third option, though:

We can decide that Bindless<ParameterBlock<X>> translates to a single index for a constant buffer that contains the data from option (1): the index of a constant buffer holding the "stripped" uniform data for X plus the index for its b texture. The benefit of this choice is that it preserves compatibility with the non-bindless memory layout, while still keeping the result as a single index.

Option (2) seems to be the place we obviously want to go in then long run, so maybe it is best to just deal with whatever compatibility hurdles it creates in the interim. And if we are doing option (2) for Bindless<ParameterBlock<X>> then we probably need to do it for Bindless<ConstantBuffer<X>> as well.

Still, I've just described a policy decision that an application might want to override...

Coming back to the original point: if we mechanically apply the chosen rules to compute Bindless<G> for the global scope, will that always produce what the user wants, or are there variations on the layout that I'm not considering?

(One big policy thing that it will be hard for Slang to help with is that in many cases a user switching to bindless might often prefer to work handles that are offsets into particular known *ByteAddressBuffers instead of indices of whole buffers, so that they don't have to bother creating whole views/descriptors for small or transient allocations. This amounts to encoding pointers as integers and having a monolithic global buffer to represent "the heap," but that is definitely a case that feels like too much policy creeping into the language, and I might prefer to hold off until the APIs grow up to have direct pointer support rather than emulate it.)

Jan 09 '19 18:01 tangent-vector

Hi, I hope you do not find my asking inappropriate.

This issue was last updated in 2019, is bindless support as written above still a planned addition? Bindless support is one of the few remaining features that I would consider crucial to my usage.

Oct 03 '21 11:10 miguel-petersen

+1 to this issue. Lack of support for either builtin or manual bindless idioms are a blocker for me using slang.

Take for example a simple bindless buffer heap, and a nice user-space generic buffer. This is something I'd expect to work, and is something I can't do in hlsl do to lack of [] overloading for assignment.

// test.slang

// Global "descriptor heap" code

[[vk::binding(0, 1)]] RWByteAddressBuffer bufs[];

struct Buf<T> {
    uint handle;

    __subscript(int i) -> T
    {
        get { return bufs[handle].Load<T>(__sizeOf<T>() * i); }
        set { bufs[handle].Store<T>(__sizeOf<T>() * i, newValue); }
    }
}

// User space shader code

struct MyData {
    Buf<float> buffer0;
    Buf<float> buffer1;
    Buf<float> result;
}
[[vk::push_constant]] ConstantBuffer<MyData> pc;

[shader("compute")]
[numthreads(1, 1, 1)]
void computeMain(int3 threadId: SV_DispatchThreadID) {
    int index = threadId.x;
    pc.result[index] = pc.buffer0[index] + pc.buffer1[index];
}

That extension error is likely resolvable on my end (and is a bit of a glslang quirk I've encountered elsewhere), but not the other one.

slangc.exe -target spirv -entry computeMain -o test.spv test.slang
glslang:  test.slang(22): error :  'variable index' : required extension not requested: GL_EXT_nonuniform_qualifier
glslang:  test.slang(22): error :  '=' :  cannot convert from 'layout( binding=0 row_major std430) temp block{layout( row_major std430 offset=0) buffer unsized 1-element array of highp float _data}' to ' temp highp float'
glslang:  test.slang(22): error :  '' : compilation terminated

As for textures, slang's deterministic bindings are a bit of a non-feature for pure-bindless engines since typically I'll have a single descriptor set binding that will need to alias different kinds of textures. Putting aside the details of writing a generic image type for now, let's just try the same thing with multiple aliased arrays of textures, which works fine in hlsl (and makes sense generally, because Vulkan descriptor set arrays allow you have many storage images of differing dimensions, datatype, etc. in the same descriptor set binding):

// test.slang

// Global "descriptor heap" code

[[vk::binding(0, 0)]] RWTexture2D<float4> img4s[];
[[vk::binding(0, 0)]] RWTexture2D<float2> img2s[];
[[vk::binding(0, 0)]] RWTexture2D<float> img1s[];

struct Img {
    uint handle;
    __subscript(int2 i) -> float4
    {
        get { return img4s[handle][i]; }
        set { img4s[handle][i] = newValue; }
    }
}

// User space shader code

struct MyData {
    Img input;
    Img output;
}

[[vk::push_constant]] ConstantBuffer<MyData> pc;

[shader("compute")]
[numthreads(8, 8, 1)]
void computeMain(int2 threadId: SV_DispatchThreadID) {
    pc.output[threadId] = pc.input[threadId];
}

Annoyingly, slang disallows these bindings to alias.

test.slang(6): warning 39001: explicit binding for parameter 'img2s' overlaps with parameter 'img4s'
[[vk::binding(0, 0)]] RWTexture2D<float2> img2s[];
                                          ^~~~~
// ...etc

If these two main issues were addressed I could use Slang for everything, since neither HLSL nor GLSL offer the ergonomics that this approach would enable.

Feb 08 '23 06:02 cshenton

As a related note, some sort of support for VK_KHR_buffer_device_address like DXC has would solve about half of these problems (though I personally prefer a descriptor heap of buffers, since that lets me use smaller 32bit handles in my push_constants).

DXC has vk::RawBufferLoad<T>(address, align) and vk::RawBufferStore<T>(address, value, align) which ergonomically aren't great, but with slang's support for user-space overloading of assign [] could be used to build a lovely user-space buffer type.

Feb 08 '23 06:02 cshenton

Okay. So credit where it's due, it does seem possible to do bindless from slang, so long as the engine code is willing to bend over backwards a bit.

tl;dr: you have to have one descriptor set per distinct texture dimension + type, to overcome slang's aliasing complants, and you need to pool your buffers, to overcome slang's buffer array bugs and lack of support for device address. If you want more than ~2GB of buffer address space you'll need a handful of hardcoded buffer bindings, but you likely wanted that for multiple GPU heaps anyway.

Here's some example code which demonstrates the general idea. Explicit vk::binding must be used as slang generates incorrect bindings by default (it tries to make multiple variable length descriptor arrays into a single descriptor set, which is not allowed).

// Descriptor heap code, probably imported

[[vk::binding(0, 0)]] RWTexture2D<float4> img4s[];
[[vk::binding(0, 1)]] RWTexture2D<float2> img2s[];
[[vk::binding(0, 2)]] RWByteAddressBuffer bufs;

struct Buf<T, let Stride: int> {
    int offset;

    __subscript(int i)->T
    {
        get { return bufs.Load<T>(offset + Stride * i); }
        set { bufs.Store<T>(offset + Stride * i, newValue); }
    }
}

struct Img2 {
    int handle;
    __subscript(int2 i)->float2
    {
        get { return img2s[handle][i]; }
        set { img2s[handle][i] = newValue; }
    }
}

struct Img4 {
    int handle;
    __subscript(int2 i) -> float4
    {
        get { return img4s[handle][i]; }
        set { img4s[handle][i] = newValue; }
    }
}

// User space shader code

struct MyData {
    Img2 img_input_a;
    Img2 img_input_b;
    Img4 img_output;
    Buf<float2, 8> buf_input_a;
    Buf<float2, 8> buf_input_b;
    Buf<float4, 16> buf_output;
}

[[vk::push_constant]] ConstantBuffer<MyData> pc;

[shader("compute")]
[numthreads(32, 32, 1)]
void imgMain(int2 ij: SV_DispatchThreadID) {
    pc.img_output[ij] = float4(pc.img_input_a[ij], pc.img_input_b[ij]);
}

[shader("compute")]
[numthreads(512, 1, 1)]
void bufMain(int i: SV_DispatchThreadID) {
    pc.buf_output[i] = float4(pc.buf_input_a[i], pc.buf_input_b[i]);
}

Since all this host complexity will be hidden from anyone writing userspace host / device code, I'm pretty happy with this. Will try and get this integrated and working tomorrow!

Feb 08 '23 08:02 cshenton

Thank you for following up on this. There are some nice low-hanging-fruit items in your example that are genuine bugs in the Slang compiler, so ideally we should work on getting those fixed (e.g., the way Slang auto-assigns bindings to global-scope arrays of unbounded size).

We are still interested in having a better solution to "bindless" at the language level. Unfortunately, all of the full-time contributors to Slang are busy with other priorities, so we haven't been able to get to implementation work for something in this space.

We would happily take contributions that help pave the way for a better bindless solution, or that add features like the vk::RawBufferLoad<>() operation that DXC supports. If anybody is interested in taking on those tasks, please let me know and I can try to help them get started.

Feb 08 '23 18:02 tangent-vector

For now I've got a solution which fulfills my needs, but if I hit another snag I may look at making a contribution. Language level support for bindless would be fantastic, and a big differentiator, as bindless shader code can get pretty ugly, which is annoying.

Feb 09 '23 00:02 cshenton

Thank you for following up on this. There are some nice low-hanging-fruit items in your example that are genuine bugs in the Slang compiler, so ideally we should work on getting those fixed (e.g., the way Slang auto-assigns bindings to global-scope arrays of unbounded size).

We are still interested in having a better solution to "bindless" at the language level. Unfortunately, all of the full-time contributors to Slang are busy with other priorities, so we haven't been able to get to implementation work for something in this space.

We would happily take contributions that help pave the way for a better bindless solution, or that add features like the vk::RawBufferLoad<>() operation that DXC supports. If anybody is interested in taking on those tasks, please let me know and I can try to help them get started.

Hey Tess!

I actually contributed the vk::RawBufferStore instruction to DXC. I might be interning at NVIDIA sometime soon. Perhaps I could make a push for a bindless syntax like this while I was there?

For Vulkan at least, I have a framework which makes extensive use of these features in DXC, so I could definitely do some local development there.

Aug 23 '23 17:08 natevm

slang
slang copied to clipboard

Add a Bindless<T> type

What is Bindless?

What’s the problem?

Ray Tracing Requirements

Proposal

Interaction with Ray Tracing Cross-Compilation

Example Usage

Challenge: reflection and binding for the big arrays

Alternatives and Open Issues

slang slang copied to clipboard

Add a Bindless<T> type

What is Bindless?

What’s the problem?

Ray Tracing Requirements

Proposal

Interaction with Ray Tracing Cross-Compilation

Example Usage

Challenge: reflection and binding for the big arrays

Alternatives and Open Issues

slang
slang copied to clipboard