DirectXShaderCompiler
DirectXShaderCompiler copied to clipboard
Compile-time goes up when "this" keyword is used as an argument for a global function.
Description When a global function is called with "this" from a member function, the performance goes worse than expected and the generated binary size becomes unusually bigger.
Steps to Reproduce When I compile the following HLSL without "USE_MEMBER", the compile-time takes much longer than with "USE_MEMBER". The generated DXIL binary is also bigger than expected without "USE_MEMBER".
#ifndef LOOP_COUNT
#define LOOP_COUNT 1000
#endif
RWByteAddressBuffer buffer;
struct MyStruct;
void MyStruct_SetValue(inout MyStruct o, int i, float v);
struct MyStruct
{
float values[LOOP_COUNT];
void SetValue(int i, float v)
{
values[i] = v;
}
void Init(float v)
{
for (int i = 0; i < LOOP_COUNT; ++i)
{
#ifdef USE_MEMBER
this.SetValue(i, v);
#else
MyStruct_SetValue(this, i, v);
#endif
}
}
};
void MyStruct_SetValue(inout MyStruct o, int i, float v)
{
o.values[i] = v;
}
[numthreads(4, 1, 1)]
void MainCS(uint2 gid : SV_GroupID, uint gidx : SV_GroupIndex)
{
MyStruct src;
src.Init(0.f);
// For some reason, this copy construction is required to
// reproduce the issue
MyStruct dst = src;
for (int i = 0; i < LOOP_COUNT; i++)
buffer.Store<float>(i * 4, dst.values[i]);
}
You can compile it with following commands:
dxc.exe -E MainCS -T cs_6_6 -DUSE_MEMBER test.hlsl -Fo with_USE_MEMBER.out
dxc.exe -E MainCS -T cs_6_6 test.hlsl -Fo without_USE_MEMBER.out
Actual Behavior Following files are generated when they are expected to be in a same or similar sizes.
-rwxrwxrwx 1 jkwak jkwak 3124 Apr 9 00:28 with_USE_MEMBER.out
-rwxrwxrwx 1 jkwak jkwak 31072 Apr 9 00:28 without_USE_MEMBER.out
Compile-time was 0.1 second for with_USE_MEMBER.out ANd it was 0.4 seconds for without_USE_MEMBER.out
Environment
- DXC version : dxcompiler.dll: 1.7 - 1.7.0.4152 (93ad5b313)
- Host Operating System : Windows 11
The slow compile time is a product of a known bug in LLVM 3.7 where propagation of alias analysis information is really slow. Fixing that bug is not easy. The fixes that went into LLVM 4.0 are massive and complicated which makes it not feasible to backport them to DXC.
As to why you see this with an inout
parameter but not the implicit this
parameter, that is how HLSL is defined. In HLSL function parameters are always passed with value semantics. For inout
parameters that means it is copied in and copied back out. We are tracking a feature proposal to add references in the future, but that change is not coming soon due to technical debt in DXC. You don't see this problem with member functions because the implicit object parameter (this
) is defined to be a true reference, so it isn't copied in and out.
LLVM can generally optimize away the copies at the cost of compile time. Your example here is demonstrating a case where LLVM 3.7 is clearly failing to fully optimize away the copies. This is caused by the loop unroll limit being set to 50. If the size is <50 we optimize it away if it is >50 we don't. You can see a succinct demonstration of that here.
With a slight change to your code to add the [unroll]
attribute before the loop the compiler can handle any size loop as demonstrated here. This does come at a cost of compile time due to the aforementioned bug in alias analysis.
We could try a timeboxed investigation into removing the propagation of alias information during inlining, with understanding that the alias analysis will need to be recomputed, but that might be a net win if it avoids this performance cliff.
Thank you for the explanation. That makes a lot of sense. And I am fine with closing the issue unless you guys want to do more with it.
Allowing general references in user code, especially in return values may not be a good idea as it opens up ability to do arbitrary pointer logic in arbitrary memory space, which can be a big can of worms on its own (e.g. passing l-values from buffers, textures, local or global variables to f(a, b)
that may return either a
or b
, and now the address space of your return reference is no longer compile time determinable). It is a lot of work, creates a lot of issues, and has little return.
Instead, given how DXC internally lowers member methods to LLVM, I wonder if it is possible for hlsl to accept something more ligthweight than full references, by just allowing by-ref parameter passing like:
void memberMethod(__ref MyType this) { ... }
Allowing general references in user code, especially in return values may not be a good idea as it opens up ability to do arbitrary pointer logic in arbitrary memory space, which can be a big can of worms on its own (e.g. passing l-values from buffers, textures, local or global variables to
f(a, b)
that may return eithera
orb
, and now the address space of your return reference is no longer compile time determinable). It is a lot of work, creates a lot of issues, and has little return.
This is kinda out-of-scope for an issue discussion, but I disagree. Introducing references to the language also requires supporting explicit address space annotations because casing addresses between address spaces needs to be illegal (something we kinda gloss over today in DXC's ASTs). This also will allow us to fix some of the places in the language where we require addresses but don't actually have a way to represent the address or address space.
Instead, given how DXC internally lowers member methods to LLVM, I wonder if it is possible for hlsl to accept something more ligthweight than full references, by just allowing by-ref parameter passing like:
void memberMethod(__ref MyType this) { ... }
I don't see how this solves the problem. Spelling it as __ref
doesn't solve any complication over just spelling it as MyType&
. We have an explicit goal not to introduce syntax that diverges from C/C++ unless there is a really good reason.
The way DXC handles references today is really really broken (especially for the implicit this
object). In most cases we ignore address spaces entirely and just hope the IR optimizer cleans it all up by illegally converting pointer types.
I haven't yet spec'd all the details, but my proposal is to add explicit address space annotations for device
(1), constant
(2) and node
(6) as well as utilities to create explicit temporary expiring references that behave like inout
to map from a non-thread address to a thread address.
It will also be possible to use templates to instantiate per-address-space variants of functions if users wish to avoid copies and not need to duplicate functions by address space.