blade icon indicating copy to clipboard operation
blade copied to clipboard

Should `descriptor_pool_info` include `FREE_DESCRIPTOR_SET` flag?

Open dannycoates opened this issue 1 year ago • 1 comments

From https://github.com/zed-industries/zed/issues/10018#issuecomment-2107354282

dannycoates avatar May 14 '24 13:05 dannycoates

VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT specifies that descriptor sets can return their individual allocations to the pool, i.e. all of vkAllocateDescriptorSets, vkFreeDescriptorSets, and vkResetDescriptorPool are allowed. Otherwise, descriptor sets allocated from the pool must not be individually freed back to the pool, i.e. only vkAllocateDescriptorSets and vkResetDescriptorPool are allowed.

We aren't doing any vkFreeDescriptorSets so this flag is meaningless. In Blade, descriptor pools are assigned to command encoders, and are reset when the encoders are reset.

This looks like one of the following things to me:

  1. A driver bug. Intel Vulkan drivers aren't expecting our usage pattern. We'll need to reach out to them and possibly land a vendor-specific workaround in Blade, such as specifying VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT
  2. We are doing something wrong. Could you run with Vulkan Configurator that enables the Validation and see if anything pops up when it runs out of pool memory?

kvark avatar May 15 '24 05:05 kvark

After several hours of debugging, I can confirm that this issue only occurs with Intel's drivers.

filename: igvk64.dll
fileversion: 31.0.101.4502
filedescription: Vulkan(R) Driver for Intel(R) Graphics Accelerator
__int64 __fastcall vkDescriptorAllocatePoolSet(__int64 a1, __int64 a2, __int64 a3)
{
	...
	// Check descriptor_pool->dword424 is bigger than set.
	// the throw VK_ERROR_OUT_OF_POOL_MEMORY error.
	//
	// When this error occurs.
	// 0:000> dd 0x00000212bdec4878+424 l1
	// 00000212`bdec4c9c  0000ea60
	// 0:000> dd 0x00000212bdec4878+420 l1
	// 00000212`bdec4c98  0000ea60
	//
	if ( v11 && (unsigned int)(descriptor_pool->dword424 + 1) > descriptor_pool->dword420 )
        {
            sub_180470550(&v36, 20LL);
            v37[0] = v36;
            CxxThrowException(v37, (_ThrowInfo *)&_TI1_AUException_VK__);
        }
	...

	// here inc dword424 
        *(_DWORD *)(v14 + 68) = descriptor_pool->dword424++;
        v30 = v12;
        v20 = v11;
        v21 = playout;
}
__int64 __fastcall vkResetDescriptorPool(__int64 hDevice, struct_descriptor_pool *descriptor_pool)
{
	// 
	// VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT = 1
	// defalut descriptor_pool->flags == 0
	// 
	if ( descriptor_pool->flags ){
		...
	
		//
		//  void __fastcall sub_18039C3B0(struct_descriptor_pool *descriptor_pool, __int64 a2)
		//  {
		//	 ...
		//	 // ONLY this way to dec descriptor_pool->dword424
		//      descriptor_pool->dword424 -= *(_DWORD *)(a2 + 68) != -1;
  		//      v16 = *(unsigned int *)(a2 + 68);
		//      ...
		// 	}
		//
	
		sub_18039C3B0(descriptor_pool, v11);
		...
	}else{
		// NERVER DEC descriptor_pool->dword424 
	}
}

progmboy avatar May 27 '24 11:05 progmboy

nice work, @progmboy ! Would you be able to file this upstream to Intel folks? Seems rather serious.

side note, things are going to be slightly better with #118 now

kvark avatar May 28 '24 00:05 kvark

@kvark Does adding FREE_DESCRIPTOR_SET flag affect other platforms? From Vulkan specification:

VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT specifies that descriptor sets can return their individual allocations to the pool, i.e. all of vkAllocateDescriptorSets, vkFreeDescriptorSets, and vkResetDescriptorPool are allowed. Otherwise, descriptor sets allocated from the pool must not be individually freed back to the pool, i.e. only vkAllocateDescriptorSets and vkResetDescriptorPool are allowed.

Even though blade only releases entire pools, it seems like this flag only extends you to allow to release each set individually. If this fixes the Intel driver bug, and doesn't meddle with other graphics cards, I think it'd be a good solution.

ske2004 avatar Jun 08 '24 11:06 ske2004

I suspect it may change the allocation strategy for the descriptors, thus making it slower for us to allocate new ones. And this would be potentially significant since Blade relies on allocating descriptors many times per frame, potentially thousands of times. For this reason, and the fact that the current code agrees with Vulkan spec, I wouldn't want to put this workaround unconditionally. I am, however, open to put it explicitly as a workaround for a specific platform bug. Looks to be affecting Windows Intel 11th gen? That's a pretty popular config...

kvark avatar Jun 09 '24 07:06 kvark

Yeah sounds good.

ske2004 avatar Jun 09 '24 11:06 ske2004

Interestingly, someone else is having this issue but on an AMD GPU: https://github.com/zed-industries/zed/issues/12561#issuecomment-2156365422 I'm not sure if these are related.

ske2004 avatar Jun 09 '24 11:06 ske2004

image

more about the GPU it's an RX 580 8G with the latest driver of 24.3.1

image

shenjackyuanjie avatar Jun 09 '24 15:06 shenjackyuanjie

Uhh, that's rather unfortunate. Maybe it's just a Windows thing in general?

kvark avatar Jun 10 '24 04:06 kvark

Uhh, that's rather unfortunate. Maybe it's just a Windows thing in general?

but there is some good news, on zed's last commit, it looks fine

e829a8c3b0564bc902cc0d0a530099be7f0f036e

shenjackyuanjie avatar Jun 10 '24 04:06 shenjackyuanjie

ah, ok, phew, so it was addressed by https://github.com/kvark/blade/pull/118

kvark avatar Jun 10 '24 05:06 kvark

Closed by #129

kvark avatar Jul 09 '24 15:07 kvark