vkoverhead
vkoverhead copied to clipboard
Optimize benchmark loops
Hoist the load from dispatch table out of the draw loop. The compiler can't do this because it doesn't know that the dispatch table can't change. This avoids one cache miss per iteration.
Benefits:
- With less overhead the driver should matter more
- Numbers go vroom
Drawbacks:
- Only converts one test
- Breaks the nice macro abstraction
20 runs of vkoverhead -test 0 -output-only
, alternating between old/new to reduce boost and thermal effects
On AMD 5950x with Nvidia driver:
N Min Max Median Avg Stddev
x 20 82285 84369 83911 83753.7 579.60087
+ 20 81658 85747 84899 84472.8 1106.8505
Difference at 95.0% confidence
719.1 +/- 565.464
0.858589% +/- 0.675151%
(Student's t, pooled s = 883.475)
No significant change on Intel i5-2500 with RADV driver. Either the effect is too small to measure or Intel CPU can speculate past this.
N Min Max Median Avg Stddev
x 20 15003 15300 15177 15165.85 66.840167
+ 20 14964 15349 15185 15175.4 90.242014
No difference proven at 95.0% confidence
Hm interesting. It seems like this could be solved by making the dev->vk
vtable const? This should enable the compiler to avoid cache misses, or so I'd imagine.
Unfortunately not. In C/C++ const
means "can't be modified through this pointer". Aliasing pointers are free to modify memory contents and even const
can be cast away, though that might be UB in some cases. Since the function pointer is in a library outside the current program the compiler can't prove it safe and can't do this optimization.
Optimized more loops in the same way, made the function pointer load a bit nicer and added a comment.
Tested only partially because of #7. Crashes in the same places before and after.
I'm still thinking on this; I like the effect, but I don't like the complexity it requires from each test case. If I can't come up with an alternative in a day or two, we'll have to go with it.
Can you test with changing the declaration of struct vk_device *dev;
to const struct vk_device *restrict dev;
?
I think a combination of const+restrict should achieve the same value here with less code complexity.
No change in the generated code for function draw
I think it should be possible to do this, it's just a question of how far down the const restrict
needs to go. Might be the case that it has to be piped all the way down through vk_dispatch_table_gen.py
, though you could probably just edit the generated code and see.
vk_dispatch_table
member in vk_device
can't be made const without some refactoring.
I think refactoring there for init is going to be easier in the long run than writing/rewriting every test case to fool the compiler into generating the same code.
After looking at this in godbolt for a while, it looks like declaring dev
as struct vk_device *const dev
will generate the same code as this intends. I think we can just do that and keep everything simple.
That does not compile:
../src/vkoverhead.c: In function ‘main’:
../src/vkoverhead.c:2558:8: error: assignment of read-only variable ‘dev’
2558 | dev = vk_device_create();
Which is a shame because this is actually a better way since it would apply to the renderpass helper methods as well.
It would require some small tweaks to device creation; something like
static struct vk_device device;
struct vk_device *const dev = &device;
as the variable declarations, and then changing vk_device_create
to initialize &device
should work.
Are you going to do this or should I attempt it?
Up to you. If you're interested, have at it, otherwise I can take it.
I'll see about it tomorrow. If you can do it before that then go ahead.
I'm not in a rush
It does not work. Change here: https://github.com/turol/vkoverhead/tree/const
The loop on main branch:
0x0000000000022f88 <+216>: mov rax,QWORD PTR [rip+0xbcb621] # 0xbee5b0 <dev>
0x0000000000022f8f <+223>: add ebp,0x1
0x0000000000022f92 <+226>: xor r9d,r9d
0x0000000000022f95 <+229>: xor r8d,r8d
0x0000000000022f98 <+232>: mov rdi,QWORD PTR [rip+0x59539] # 0x7c4d8 <cmdbuf>
0x0000000000022f9f <+239>: xor ecx,ecx
0x0000000000022fa1 <+241>: mov edx,0x1
0x0000000000022fa6 <+246>: mov esi,0x3
0x0000000000022fab <+251>: call QWORD PTR [rax+0x1d08]
0x0000000000022fb1 <+257>: add QWORD PTR [rip+0x558a7],0x1 # 0x78860 <count>
0x0000000000022fb9 <+265>: cmp ebp,ebx
0x0000000000022fbb <+267>: jne 0x22f88 <draw+216>
The loop on my const branch:
0x0000000000022c88 <+216>: add ebp,0x1
0x0000000000022c8b <+219>: mov rdi,QWORD PTR [rip+0x58846] # 0x7b4d8 <cmdbuf>
0x0000000000022c92 <+226>: xor r9d,r9d
0x0000000000022c95 <+229>: xor r8d,r8d
0x0000000000022c98 <+232>: xor ecx,ecx
0x0000000000022c9a <+234>: mov edx,0x1
0x0000000000022c9f <+239>: mov esi,0x3
0x0000000000022ca4 <+244>: call QWORD PTR [rip+0xbcc61e] # 0xbef2c8 <device+7432>
0x0000000000022caa <+250>: add QWORD PTR [rip+0x54bae],0x1 # 0x77860 <count>
0x0000000000022cb2 <+258>: cmp ebp,ebx
0x0000000000022cb4 <+260>: jne 0x22c88 <draw+216>
As you can see the call is still using a memory load. It's avoiding loading the dev pointer however. Adding restrict
does not help. clang produces similar code.
Contrast this branch (vroom):
0x0000000000023268 <+232>: add ebp,0x1
0x000000000002326b <+235>: mov rdi,QWORD PTR [rip+0x59266] # 0x7c4d8 <cmdbuf>
0x0000000000023272 <+242>: xor r9d,r9d
0x0000000000023275 <+245>: xor r8d,r8d
0x0000000000023278 <+248>: xor ecx,ecx
0x000000000002327a <+250>: mov edx,0x1
0x000000000002327f <+255>: mov esi,0x3
0x0000000000023284 <+260>: call r12
0x0000000000023287 <+263>: add QWORD PTR [rip+0x555d1],0x1 # 0x78860 <count>
0x000000000002328f <+271>: cmp ebp,ebx
0x0000000000023291 <+273>: jne 0x23268 <draw+232>
You can see the function address is kept in a register. The Vulkan driver probably spills it but that goes on the stack and stays either in cache or maybe even a CPU internal buffer.
The loop still contains reload of cmdbuf
and increment of count
. With those fixed in https://github.com/turol/vkoverhead/tree/extreme :
0x0000000000022fa0 <+240>: add ebp,0x1
0x0000000000022fa3 <+243>: xor r9d,r9d
0x0000000000022fa6 <+246>: xor r8d,r8d
0x0000000000022fa9 <+249>: xor ecx,ecx
0x0000000000022fab <+251>: mov edx,0x1
0x0000000000022fb0 <+256>: mov esi,0x3
0x0000000000022fb5 <+261>: mov rdi,r13
0x0000000000022fb8 <+264>: call r12
0x0000000000022fbb <+267>: cmp ebp,ebx
0x0000000000022fbd <+269>: jne 0x22fa0 <draw+240>
All memory access in the loop has gone away. This version produces a +2.5% improvement on AMD 5950X and a statistically significant +0.6% improvement on Intel i5-2500.
Nice findings.
https://github.com/turol/vkoverhead/tree/extreme doesn't seem to have any new commits though?
Fixed extreme branch
Alright, after more consideration, I think maybe we're going about this the wrong way. Instead of trying to solve this at the test level, which complicates every test, we should be trying to solve this at the function pointer level.
I think what we really want to do here is use ifuncs:
- modify the dispatch table codegen
- for every function, generate both a function declaration (e.g., VKCmdDraw) and a resolver function (e.g., resolve_vkCmdDraw`
- replace all VK(func) occurrences with VKfunc
So this would look like:
void VKCmdDraw(VkCommandBuffer cmdbuf, uint32_t vertexCount, uint32_t instanceCount, uint32_t firstVertex, uint32_t firstInstance) __attribute__((ifunc ("resolve_vkCmdDraw")));
static void *resolve_vkCmdDraw(void)
{
return vkGetDeviceProcAddr(dev->dev, "vkCmdDraw");
}
The resolver function would be evaluated the first call, and then each ifunc function would "be" the Vulkan call for the remainder of the program. No extra code needed in the tests, same performance as your extreme branch.
I don't know if that does the same thing. The description says that it patches the PLT so it sounds like it would still be loaded from memory. Unless you have tested that this is a load that the compiler is allowed to hoist?
Is this supported by clang? It definitely will not be supported on Windows but I don't know if you care about portability that much.
In any case since it appears to be a change to the loader generator I don't want to do it. Do you intend to do it?
Alright, so after exploring ifuncs further, this isn't a tractable solution because of the GetProcAddress requirement.
I'm still not super enthusiastic about having to change every single test (and every future test) to handle this. Wouldn't having codegen just generate functions as global pointers solve the same issue?
I think the compiler would still consider those clobbered by the function call but I have not tested that.
Would be worth testing. If it works, the dispatch table can just be made a global, and everything becomes much simpler and less work.
I wrote a quick test case https://godbolt.org/z/PxYPdarqv
typedef void (*func_type)(int, int, int, int);
static
func_type global;
void set(func_type f) {
global = f;
}
void use_global() {
for (int i = 0; i < 100; i++) {
global(1, 2, 3, 4);
}
}
void use_local() {
func_type local = global;
for (int i = 0; i < 100; i++) {
local(1, 2, 3, 4);
}
}
Even with the function as a global pointer it will only be kept in a register if explicitly loaded into a local variable.
Rebased and optimized the vkCmdSetDescriptorBufferOffsetsEXT loops.
Hi, I think optimizing the loops is fine as long as it doesn't hurt the readability and maintainability of the individual tests. In other words, I'm not a fan of having to load the individual function pointers on each test.
vkoverhead will inevitably have application-side overhead; but every driver is paying the same cost. It doesn't really matter for measuring performance changes or doing comparisons.
I don't think "every driver is paying the same cost". It depends heavily on the size of the CPU caches and how much the driver pollutes it. Depending on which cache level the vkoverhead data is evicted to there can be sudden jumps or drops in performance numbers.
I've taken some time over my sabbatical to consider this, and I think I'm in agreement with @rlocatti-nv now. While I certainly want to make vkoverhead as fast as possible, I don't think I can justify complex, test-specific changes like these which will both reduce maintainability and make it more cumbersome to add new tests.