vkoverhead icon indicating copy to clipboard operation
vkoverhead copied to clipboard

Optimize benchmark loops

Open turol opened this issue 2 years ago • 29 comments

Hoist the load from dispatch table out of the draw loop. The compiler can't do this because it doesn't know that the dispatch table can't change. This avoids one cache miss per iteration.

Benefits:

  1. With less overhead the driver should matter more
  2. Numbers go vroom

Drawbacks:

  1. Only converts one test
  2. Breaks the nice macro abstraction

20 runs of vkoverhead -test 0 -output-only, alternating between old/new to reduce boost and thermal effects On AMD 5950x with Nvidia driver:

    N           Min           Max        Median           Avg        Stddev
x  20         82285         84369         83911       83753.7     579.60087
+  20         81658         85747         84899       84472.8     1106.8505
Difference at 95.0% confidence
	719.1 +/- 565.464
	0.858589% +/- 0.675151%
	(Student's t, pooled s = 883.475)

No significant change on Intel i5-2500 with RADV driver. Either the effect is too small to measure or Intel CPU can speculate past this.

    N           Min           Max        Median           Avg        Stddev
x  20         15003         15300         15177      15165.85     66.840167
+  20         14964         15349         15185       15175.4     90.242014
No difference proven at 95.0% confidence

turol avatar Sep 16 '22 08:09 turol

Hm interesting. It seems like this could be solved by making the dev->vk vtable const? This should enable the compiler to avoid cache misses, or so I'd imagine.

zmike avatar Sep 16 '22 17:09 zmike

Unfortunately not. In C/C++ const means "can't be modified through this pointer". Aliasing pointers are free to modify memory contents and even const can be cast away, though that might be UB in some cases. Since the function pointer is in a library outside the current program the compiler can't prove it safe and can't do this optimization.

turol avatar Sep 16 '22 18:09 turol

Optimized more loops in the same way, made the function pointer load a bit nicer and added a comment.

Tested only partially because of #7. Crashes in the same places before and after.

turol avatar Sep 19 '22 12:09 turol

I'm still thinking on this; I like the effect, but I don't like the complexity it requires from each test case. If I can't come up with an alternative in a day or two, we'll have to go with it.

zmike avatar Sep 19 '22 13:09 zmike

Can you test with changing the declaration of struct vk_device *dev; to const struct vk_device *restrict dev; ?

I think a combination of const+restrict should achieve the same value here with less code complexity.

zmike avatar Sep 19 '22 16:09 zmike

No change in the generated code for function draw

turol avatar Sep 19 '22 16:09 turol

I think it should be possible to do this, it's just a question of how far down the const restrict needs to go. Might be the case that it has to be piped all the way down through vk_dispatch_table_gen.py, though you could probably just edit the generated code and see.

zmike avatar Sep 19 '22 16:09 zmike

vk_dispatch_table member in vk_device can't be made const without some refactoring.

turol avatar Sep 19 '22 16:09 turol

I think refactoring there for init is going to be easier in the long run than writing/rewriting every test case to fool the compiler into generating the same code.

zmike avatar Sep 19 '22 17:09 zmike

After looking at this in godbolt for a while, it looks like declaring dev as struct vk_device *const dev will generate the same code as this intends. I think we can just do that and keep everything simple.

zmike avatar Sep 20 '22 14:09 zmike

That does not compile:

../src/vkoverhead.c: In function ‘main’:
../src/vkoverhead.c:2558:8: error: assignment of read-only variable ‘dev’
 2558 |    dev = vk_device_create();

Which is a shame because this is actually a better way since it would apply to the renderpass helper methods as well.

turol avatar Sep 20 '22 14:09 turol

It would require some small tweaks to device creation; something like

static struct vk_device device;
struct vk_device *const dev = &device;

as the variable declarations, and then changing vk_device_create to initialize &device should work.

zmike avatar Sep 20 '22 15:09 zmike

Are you going to do this or should I attempt it?

turol avatar Sep 20 '22 15:09 turol

Up to you. If you're interested, have at it, otherwise I can take it.

zmike avatar Sep 20 '22 16:09 zmike

I'll see about it tomorrow. If you can do it before that then go ahead.

turol avatar Sep 20 '22 16:09 turol

I'm not in a rush

zmike avatar Sep 20 '22 16:09 zmike

It does not work. Change here: https://github.com/turol/vkoverhead/tree/const

The loop on main branch:

   0x0000000000022f88 <+216>:	mov    rax,QWORD PTR [rip+0xbcb621]        # 0xbee5b0 <dev>
   0x0000000000022f8f <+223>:	add    ebp,0x1
   0x0000000000022f92 <+226>:	xor    r9d,r9d
   0x0000000000022f95 <+229>:	xor    r8d,r8d
   0x0000000000022f98 <+232>:	mov    rdi,QWORD PTR [rip+0x59539]        # 0x7c4d8 <cmdbuf>
   0x0000000000022f9f <+239>:	xor    ecx,ecx
   0x0000000000022fa1 <+241>:	mov    edx,0x1
   0x0000000000022fa6 <+246>:	mov    esi,0x3
   0x0000000000022fab <+251>:	call   QWORD PTR [rax+0x1d08]
   0x0000000000022fb1 <+257>:	add    QWORD PTR [rip+0x558a7],0x1        # 0x78860 <count>
   0x0000000000022fb9 <+265>:	cmp    ebp,ebx
   0x0000000000022fbb <+267>:	jne    0x22f88 <draw+216>

The loop on my const branch:

   0x0000000000022c88 <+216>:	add    ebp,0x1
   0x0000000000022c8b <+219>:	mov    rdi,QWORD PTR [rip+0x58846]        # 0x7b4d8 <cmdbuf>
   0x0000000000022c92 <+226>:	xor    r9d,r9d
   0x0000000000022c95 <+229>:	xor    r8d,r8d
   0x0000000000022c98 <+232>:	xor    ecx,ecx
   0x0000000000022c9a <+234>:	mov    edx,0x1
   0x0000000000022c9f <+239>:	mov    esi,0x3
   0x0000000000022ca4 <+244>:	call   QWORD PTR [rip+0xbcc61e]        # 0xbef2c8 <device+7432>
   0x0000000000022caa <+250>:	add    QWORD PTR [rip+0x54bae],0x1        # 0x77860 <count>
   0x0000000000022cb2 <+258>:	cmp    ebp,ebx
   0x0000000000022cb4 <+260>:	jne    0x22c88 <draw+216>

As you can see the call is still using a memory load. It's avoiding loading the dev pointer however. Adding restrict does not help. clang produces similar code.

Contrast this branch (vroom):

   0x0000000000023268 <+232>:	add    ebp,0x1
   0x000000000002326b <+235>:	mov    rdi,QWORD PTR [rip+0x59266]        # 0x7c4d8 <cmdbuf>
   0x0000000000023272 <+242>:	xor    r9d,r9d
   0x0000000000023275 <+245>:	xor    r8d,r8d
   0x0000000000023278 <+248>:	xor    ecx,ecx
   0x000000000002327a <+250>:	mov    edx,0x1
   0x000000000002327f <+255>:	mov    esi,0x3
   0x0000000000023284 <+260>:	call   r12
   0x0000000000023287 <+263>:	add    QWORD PTR [rip+0x555d1],0x1        # 0x78860 <count>
   0x000000000002328f <+271>:	cmp    ebp,ebx
   0x0000000000023291 <+273>:	jne    0x23268 <draw+232>

You can see the function address is kept in a register. The Vulkan driver probably spills it but that goes on the stack and stays either in cache or maybe even a CPU internal buffer.

The loop still contains reload of cmdbuf and increment of count. With those fixed in https://github.com/turol/vkoverhead/tree/extreme :

   0x0000000000022fa0 <+240>:	add    ebp,0x1
   0x0000000000022fa3 <+243>:	xor    r9d,r9d
   0x0000000000022fa6 <+246>:	xor    r8d,r8d
   0x0000000000022fa9 <+249>:	xor    ecx,ecx
   0x0000000000022fab <+251>:	mov    edx,0x1
   0x0000000000022fb0 <+256>:	mov    esi,0x3
   0x0000000000022fb5 <+261>:	mov    rdi,r13
   0x0000000000022fb8 <+264>:	call   r12
   0x0000000000022fbb <+267>:	cmp    ebp,ebx
   0x0000000000022fbd <+269>:	jne    0x22fa0 <draw+240>

All memory access in the loop has gone away. This version produces a +2.5% improvement on AMD 5950X and a statistically significant +0.6% improvement on Intel i5-2500.

turol avatar Sep 21 '22 12:09 turol

Nice findings.

https://github.com/turol/vkoverhead/tree/extreme doesn't seem to have any new commits though?

zmike avatar Sep 21 '22 12:09 zmike

Fixed extreme branch

turol avatar Sep 21 '22 12:09 turol

Alright, after more consideration, I think maybe we're going about this the wrong way. Instead of trying to solve this at the test level, which complicates every test, we should be trying to solve this at the function pointer level.

I think what we really want to do here is use ifuncs:

  • modify the dispatch table codegen
    • for every function, generate both a function declaration (e.g., VKCmdDraw) and a resolver function (e.g., resolve_vkCmdDraw`
  • replace all VK(func) occurrences with VKfunc

So this would look like:

void VKCmdDraw(VkCommandBuffer cmdbuf, uint32_t vertexCount, uint32_t instanceCount, uint32_t firstVertex, uint32_t firstInstance) __attribute__((ifunc ("resolve_vkCmdDraw")));

static void *resolve_vkCmdDraw(void)
{
   return vkGetDeviceProcAddr(dev->dev, "vkCmdDraw");
}

The resolver function would be evaluated the first call, and then each ifunc function would "be" the Vulkan call for the remainder of the program. No extra code needed in the tests, same performance as your extreme branch.

zmike avatar Sep 22 '22 14:09 zmike

I don't know if that does the same thing. The description says that it patches the PLT so it sounds like it would still be loaded from memory. Unless you have tested that this is a load that the compiler is allowed to hoist?

Is this supported by clang? It definitely will not be supported on Windows but I don't know if you care about portability that much.

In any case since it appears to be a change to the loader generator I don't want to do it. Do you intend to do it?

turol avatar Sep 22 '22 16:09 turol

Alright, so after exploring ifuncs further, this isn't a tractable solution because of the GetProcAddress requirement.

I'm still not super enthusiastic about having to change every single test (and every future test) to handle this. Wouldn't having codegen just generate functions as global pointers solve the same issue?

zmike avatar Sep 22 '22 17:09 zmike

I think the compiler would still consider those clobbered by the function call but I have not tested that.

turol avatar Sep 22 '22 17:09 turol

Would be worth testing. If it works, the dispatch table can just be made a global, and everything becomes much simpler and less work.

zmike avatar Sep 23 '22 12:09 zmike

I wrote a quick test case https://godbolt.org/z/PxYPdarqv

typedef void (*func_type)(int, int, int, int);


static
func_type global;


void set(func_type f) {
	global = f;
}


void use_global() {
	for (int i = 0; i < 100; i++) {
		global(1, 2, 3, 4);
	}
}


void use_local() {
	func_type local = global;

	for (int i = 0; i < 100; i++) {
		local(1, 2, 3, 4);
	}
}

Even with the function as a global pointer it will only be kept in a register if explicitly loaded into a local variable.

turol avatar Nov 22 '22 14:11 turol

Rebased and optimized the vkCmdSetDescriptorBufferOffsetsEXT loops.

turol avatar Jan 05 '23 11:01 turol

Hi, I think optimizing the loops is fine as long as it doesn't hurt the readability and maintainability of the individual tests. In other words, I'm not a fan of having to load the individual function pointers on each test.

vkoverhead will inevitably have application-side overhead; but every driver is paying the same cost. It doesn't really matter for measuring performance changes or doing comparisons.

rlocatti-nv avatar Jan 05 '23 21:01 rlocatti-nv

I don't think "every driver is paying the same cost". It depends heavily on the size of the CPU caches and how much the driver pollutes it. Depending on which cache level the vkoverhead data is evicted to there can be sudden jumps or drops in performance numbers.

turol avatar Jan 10 '23 14:01 turol

I've taken some time over my sabbatical to consider this, and I think I'm in agreement with @rlocatti-nv now. While I certainly want to make vkoverhead as fast as possible, I don't think I can justify complex, test-specific changes like these which will both reduce maintainability and make it more cumbersome to add new tests.

zmike avatar Jan 10 '23 14:01 zmike