Deadlock on device destruction due to mutex
Hey there, this is a bit of an odd one. I've written a Vulkan layer which requires a separate Vulkan device to be created internally. This is where I'm encountering an issue.
It seems that when the layer gets unloaded as part of the application closing, that mutex is locked, meaning the destruction of my library is unable to destroy internally created devices.
Creating the internal device via the next layers function is not viable. I've thought about this days and I couldn't find a solution.
Source is here, though I hope this issue is understandable without requiring to sit through my code (please ask followup questions!).
I wouldn't call this a bug, but I do really need help to work around this, thanks!
So I can immediately tell you to NOT call vkCreateDevice from inside of vkCreateDevice. What do I mean? DO call down the vkCreateDevice chain as you normally would. Do not call vkCreateDevice a second time to create a separate device.
Rather, you must call pfnLayerCreateDevice which is in the VkLayerInstanceCreateInfo chain. The technical details are contained in the PR that added it #220 but suffice to say, when the application calls vkCreateDevice certain steps are taken which must happen in order for the device to be successfully created.
I looked through the docs and found nothing which means the bug is the lack of documentation of that requirement.
It seems that when the layer gets unloaded as part of the application closing
That shouldn't happen. Or well, I think everyone would like that to not happen, and if there is something the loader does during shutdown that could help, I'm all ears. Calling vkDestroyDevice/vkDestroyInstance on behalf of the app for example.
After a bit of headscratching in direct messages with @charles-lunarg we've found a good chunk of information and I will summarize it here.
I will be using the terms "application", "layer" and "loader" very carefully here.
First of all, recreating the issue:
- A Vulkan layer may choose to create a secondary Vulkan device or perhaps even instance for one reason or another.
- If this creation (or the destruction later on) happens during the layer's intercepted
vk{Create,Destroy}{Device,Instance}()call, the application will freeze.
Tracing the call, this is what happens:
- The application calls the trampoline function vkCreateInstance (or similar) in the loader
- The loader locks an internal mutex called
loader_lockin trampoline.c, as it is about to modify global structures. - The loader then calls the chain of layers through to the driver, which at some point calls the problematic layer's vkCreateInstance function.
- The problematic layer then calls vkCreateInstance, which will call the trampoline function in the loader again.
- The loader tries to lock the mutex again, but as it is already locked it will freeze forever.
A solution to this has kind of been implemented before. The pNext chain in the layer's vkCreateDevice creation info contains an entry with the members pfnLayer{Create,Destroy}Device, which is set to the internal loader function, skipping the trampoline and thus skipping locking the mutex.
There is no members for creating an instance and even the device-level function is problematic as it cannot be multithreaded easily (at all?).
Here is a real stacktrace from my application:
Thread 15 (Thread 0x7fffabbff6c0 (LWP 5559) "vo"):
#0 0x00007ffff3ee12a0 in ?? () from /usr/lib/libc.so.6
#1 0x00007ffff3ee74e2 in pthread_mutex_lock () from /usr/lib/libc.so.6
#2 0x00007ffff40fb3dc in vkDestroyDevice () from /usr/lib/libvulkan.so.1
#3 0x00007fffaae468d0 in LSFG::Core::Device::Device(LSFG::Core::Instance const&, unsigned long)::$_0::operator()(VkDevice_T**) const (this=0x7fffa42ba110, device=0x7fffa67aba20) at /home/pancake/dev/c++/lsfg-vk/lsfg-vk-common/src/core/device.cpp:111
(... more destructor stuff ...)
#16 0x00007ffff3e91671 in __cxa_finalize () from /usr/lib/libc.so.6
#17 0x00007fffaad240c8 in ?? () from /home/pancake/.local/share/vulkan/implicit_layer.d/../../../lib/liblsfg-vk.so
#18 0x00007fffa4256350 in ?? ()
#19 0x00007ffff7fc7fc2 in ?? () from /lib64/ld-linux-x86-64.so.2
#20 0x00007ffff7fc844e in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#21 0x00007ffff7fc8b95 in ?? () from /lib64/ld-linux-x86-64.so.2
#22 0x00007ffff7fc946a in ?? () from /lib64/ld-linux-x86-64.so.2
#23 0x00007ffff7fc83c1 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
#24 0x00007ffff7fc84e3 in ?? () from /lib64/ld-linux-x86-64.so.2
#25 0x00007ffff3ee00a7 in ?? () from /usr/lib/libc.so.6
#26 0x00007ffff3edfdd6 in dlclose () from /usr/lib/libc.so.6
#27 0x00007ffff40f5619 in ?? () from /usr/lib/libvulkan.so.1
#28 0x00007ffff40fae31 in vkDestroyInstance () from /usr/lib/libvulkan.so.1
#29 0x00007ffff6389fa8 in pl_vk_inst_destroy () from /usr/lib/libplacebo.so.351
#30 0x00005555556cc927 in ?? ()
#31 0x00005555556ccf4e in ?? ()
#32 0x0000555555648d6a in ?? ()
#33 0x000055555566c07e in ?? ()
#34 0x00007ffff3ee40b9 in ?? () from /usr/lib/libc.so.6
#35 0x00007ffff3f6438c in ?? () from /usr/lib/libc.so.6
Noteworthy about my stacktrace here, is that the issue only becomes apparent on cleanup, as the instance and device creation is delayed to first swapchain creation, where such a mutex does not exist (this was the original workaround on my side). In my example, the application's call to vkDestroyInstance unloads all loaded layers, which means the C++ destructor is called on my device and instance, thus triggering the freeze because the mutex is also locked.
The Vulkan specification for vkDestroyInstance only states "Host access to instance must be externally synchronized".
Perhaps the mutex should be unlocked during calls to foreign code (this being the layer functions and potentially more, I'm not sure). Or perhaps the mutex should notice that it is being called on the same thread and allow re-entry?