vkoverhead Optimize benchmark loops

Hoist the load from dispatch table out of the draw loop. The compiler can't do this because it doesn't know that the dispatch table can't change. This avoids one cache miss per iteration.

Benefits:

With less overhead the driver should matter more
Numbers go vroom

Drawbacks:

Only converts one test
Breaks the nice macro abstraction

20 runs of vkoverhead -test 0 -output-only, alternating between old/new to reduce boost and thermal effects On AMD 5950x with Nvidia driver:

    N           Min           Max        Median           Avg        Stddev
x  20         82285         84369         83911       83753.7     579.60087
+  20         81658         85747         84899       84472.8     1106.8505
Difference at 95.0% confidence
	719.1 +/- 565.464
	0.858589% +/- 0.675151%
	(Student's t, pooled s = 883.475)

No significant change on Intel i5-2500 with RADV driver. Either the effect is too small to measure or Intel CPU can speculate past this.

    N           Min           Max        Median           Avg        Stddev
x  20         15003         15300         15177      15165.85     66.840167
+  20         14964         15349         15185       15175.4     90.242014
No difference proven at 95.0% confidence

Sep 16 '22 08:09 turol

Hm interesting. It seems like this could be solved by making the dev->vk vtable const? This should enable the compiler to avoid cache misses, or so I'd imagine.

Sep 16 '22 17:09 zmike

Unfortunately not. In C/C++ const means "can't be modified through this pointer". Aliasing pointers are free to modify memory contents and even const can be cast away, though that might be UB in some cases. Since the function pointer is in a library outside the current program the compiler can't prove it safe and can't do this optimization.

Sep 16 '22 18:09 turol

Optimized more loops in the same way, made the function pointer load a bit nicer and added a comment.

Tested only partially because of #7. Crashes in the same places before and after.

Sep 19 '22 12:09 turol

I'm still thinking on this; I like the effect, but I don't like the complexity it requires from each test case. If I can't come up with an alternative in a day or two, we'll have to go with it.

Sep 19 '22 13:09 zmike

Can you test with changing the declaration of struct vk_device *dev; to const struct vk_device *restrict dev; ?

I think a combination of const+restrict should achieve the same value here with less code complexity.

Sep 19 '22 16:09 zmike

No change in the generated code for function draw

Sep 19 '22 16:09 turol

I think it should be possible to do this, it's just a question of how far down the const restrict needs to go. Might be the case that it has to be piped all the way down through vk_dispatch_table_gen.py, though you could probably just edit the generated code and see.

Sep 19 '22 16:09 zmike

vk_dispatch_table member in vk_device can't be made const without some refactoring.

Sep 19 '22 16:09 turol

I think refactoring there for init is going to be easier in the long run than writing/rewriting every test case to fool the compiler into generating the same code.

Sep 19 '22 17:09 zmike

After looking at this in godbolt for a while, it looks like declaring dev as struct vk_device *const dev will generate the same code as this intends. I think we can just do that and keep everything simple.

Sep 20 '22 14:09 zmike

That does not compile:

../src/vkoverhead.c: In function ‘main’:
../src/vkoverhead.c:2558:8: error: assignment of read-only variable ‘dev’
 2558 |    dev = vk_device_create();

Which is a shame because this is actually a better way since it would apply to the renderpass helper methods as well.

Sep 20 '22 14:09 turol

It would require some small tweaks to device creation; something like

static struct vk_device device;
struct vk_device *const dev = &device;

as the variable declarations, and then changing vk_device_create to initialize &device should work.

Sep 20 '22 15:09 zmike

Are you going to do this or should I attempt it?

Sep 20 '22 15:09 turol

Up to you. If you're interested, have at it, otherwise I can take it.

Sep 20 '22 16:09 zmike

I'll see about it tomorrow. If you can do it before that then go ahead.

Sep 20 '22 16:09 turol

I'm not in a rush

Sep 20 '22 16:09 zmike

It does not work. Change here: https://github.com/turol/vkoverhead/tree/const

The loop on main branch:

   0x0000000000022f88 <+216>:	mov    rax,QWORD PTR [rip+0xbcb621]        # 0xbee5b0 <dev>
   0x0000000000022f8f <+223>:	add    ebp,0x1
   0x0000000000022f92 <+226>:	xor    r9d,r9d
   0x0000000000022f95 <+229>:	xor    r8d,r8d
   0x0000000000022f98 <+232>:	mov    rdi,QWORD PTR [rip+0x59539]        # 0x7c4d8 <cmdbuf>
   0x0000000000022f9f <+239>:	xor    ecx,ecx
   0x0000000000022fa1 <+241>:	mov    edx,0x1
   0x0000000000022fa6 <+246>:	mov    esi,0x3
   0x0000000000022fab <+251>:	call   QWORD PTR [rax+0x1d08]
   0x0000000000022fb1 <+257>:	add    QWORD PTR [rip+0x558a7],0x1        # 0x78860 <count>
   0x0000000000022fb9 <+265>:	cmp    ebp,ebx
   0x0000000000022fbb <+267>:	jne    0x22f88 <draw+216>

The loop on my const branch:

   0x0000000000022c88 <+216>:	add    ebp,0x1
   0x0000000000022c8b <+219>:	mov    rdi,QWORD PTR [rip+0x58846]        # 0x7b4d8 <cmdbuf>
   0x0000000000022c92 <+226>:	xor    r9d,r9d
   0x0000000000022c95 <+229>:	xor    r8d,r8d
   0x0000000000022c98 <+232>:	xor    ecx,ecx
   0x0000000000022c9a <+234>:	mov    edx,0x1
   0x0000000000022c9f <+239>:	mov    esi,0x3
   0x0000000000022ca4 <+244>:	call   QWORD PTR [rip+0xbcc61e]        # 0xbef2c8 <device+7432>
   0x0000000000022caa <+250>:	add    QWORD PTR [rip+0x54bae],0x1        # 0x77860 <count>
   0x0000000000022cb2 <+258>:	cmp    ebp,ebx
   0x0000000000022cb4 <+260>:	jne    0x22c88 <draw+216>

As you can see the call is still using a memory load. It's avoiding loading the dev pointer however. Adding restrict does not help. clang produces similar code.

Contrast this branch (vroom):

   0x0000000000023268 <+232>:	add    ebp,0x1
   0x000000000002326b <+235>:	mov    rdi,QWORD PTR [rip+0x59266]        # 0x7c4d8 <cmdbuf>
   0x0000000000023272 <+242>:	xor    r9d,r9d
   0x0000000000023275 <+245>:	xor    r8d,r8d
   0x0000000000023278 <+248>:	xor    ecx,ecx
   0x000000000002327a <+250>:	mov    edx,0x1
   0x000000000002327f <+255>:	mov    esi,0x3
   0x0000000000023284 <+260>:	call   r12
   0x0000000000023287 <+263>:	add    QWORD PTR [rip+0x555d1],0x1        # 0x78860 <count>
   0x000000000002328f <+271>:	cmp    ebp,ebx
   0x0000000000023291 <+273>:	jne    0x23268 <draw+232>

You can see the function address is kept in a register. The Vulkan driver probably spills it but that goes on the stack and stays either in cache or maybe even a CPU internal buffer.

The loop still contains reload of cmdbuf and increment of count. With those fixed in https://github.com/turol/vkoverhead/tree/extreme :

   0x0000000000022fa0 <+240>:	add    ebp,0x1
   0x0000000000022fa3 <+243>:	xor    r9d,r9d
   0x0000000000022fa6 <+246>:	xor    r8d,r8d
   0x0000000000022fa9 <+249>:	xor    ecx,ecx
   0x0000000000022fab <+251>:	mov    edx,0x1
   0x0000000000022fb0 <+256>:	mov    esi,0x3
   0x0000000000022fb5 <+261>:	mov    rdi,r13
   0x0000000000022fb8 <+264>:	call   r12
   0x0000000000022fbb <+267>:	cmp    ebp,ebx
   0x0000000000022fbd <+269>:	jne    0x22fa0 <draw+240>

All memory access in the loop has gone away. This version produces a +2.5% improvement on AMD 5950X and a statistically significant +0.6% improvement on Intel i5-2500.

Sep 21 '22 12:09 turol

Nice findings.

https://github.com/turol/vkoverhead/tree/extreme doesn't seem to have any new commits though?

Sep 21 '22 12:09 zmike

Fixed extreme branch

Sep 21 '22 12:09 turol

Alright, after more consideration, I think maybe we're going about this the wrong way. Instead of trying to solve this at the test level, which complicates every test, we should be trying to solve this at the function pointer level.

I think what we really want to do here is use ifuncs:

modify the dispatch table codegen
- for every function, generate both a function declaration (e.g., VKCmdDraw) and a resolver function (e.g., resolve_vkCmdDraw`
replace all VK(func) occurrences with VKfunc

So this would look like:

void VKCmdDraw(VkCommandBuffer cmdbuf, uint32_t vertexCount, uint32_t instanceCount, uint32_t firstVertex, uint32_t firstInstance) __attribute__((ifunc ("resolve_vkCmdDraw")));

static void *resolve_vkCmdDraw(void)
{
   return vkGetDeviceProcAddr(dev->dev, "vkCmdDraw");
}

The resolver function would be evaluated the first call, and then each ifunc function would "be" the Vulkan call for the remainder of the program. No extra code needed in the tests, same performance as your extreme branch.

Sep 22 '22 14:09 zmike

I don't know if that does the same thing. The description says that it patches the PLT so it sounds like it would still be loaded from memory. Unless you have tested that this is a load that the compiler is allowed to hoist?

Is this supported by clang? It definitely will not be supported on Windows but I don't know if you care about portability that much.

In any case since it appears to be a change to the loader generator I don't want to do it. Do you intend to do it?

Sep 22 '22 16:09 turol

Alright, so after exploring ifuncs further, this isn't a tractable solution because of the GetProcAddress requirement.

I'm still not super enthusiastic about having to change every single test (and every future test) to handle this. Wouldn't having codegen just generate functions as global pointers solve the same issue?

Sep 22 '22 17:09 zmike

I think the compiler would still consider those clobbered by the function call but I have not tested that.

Sep 22 '22 17:09 turol

Would be worth testing. If it works, the dispatch table can just be made a global, and everything becomes much simpler and less work.

Sep 23 '22 12:09 zmike

I wrote a quick test case https://godbolt.org/z/PxYPdarqv

typedef void (*func_type)(int, int, int, int);


static
func_type global;


void set(func_type f) {
	global = f;
}


void use_global() {
	for (int i = 0; i < 100; i++) {
		global(1, 2, 3, 4);
	}
}


void use_local() {
	func_type local = global;

	for (int i = 0; i < 100; i++) {
		local(1, 2, 3, 4);
	}
}

Even with the function as a global pointer it will only be kept in a register if explicitly loaded into a local variable.

Nov 22 '22 14:11 turol

Rebased and optimized the vkCmdSetDescriptorBufferOffsetsEXT loops.

Jan 05 '23 11:01 turol

Hi, I think optimizing the loops is fine as long as it doesn't hurt the readability and maintainability of the individual tests. In other words, I'm not a fan of having to load the individual function pointers on each test.

vkoverhead will inevitably have application-side overhead; but every driver is paying the same cost. It doesn't really matter for measuring performance changes or doing comparisons.

Jan 05 '23 21:01 rlocatti-nv

I don't think "every driver is paying the same cost". It depends heavily on the size of the CPU caches and how much the driver pollutes it. Depending on which cache level the vkoverhead data is evicted to there can be sudden jumps or drops in performance numbers.

Jan 10 '23 14:01 turol

I've taken some time over my sabbatical to consider this, and I think I'm in agreement with @rlocatti-nv now. While I certainly want to make vkoverhead as fast as possible, I don't think I can justify complex, test-specific changes like these which will both reduce maintainability and make it more cumbersome to add new tests.

Jan 10 '23 14:01 zmike

vkoverhead vkoverhead copied to clipboard

Optimize benchmark loops

vkoverhead
vkoverhead copied to clipboard