Renderer GPU(vulkan): stops rendering after window resize
With the following error log:
ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:
The rest of the application continues ~fine~ sdl events seem to stop.
on https://github.com/libsdl-org/SDL/releases/tag/preview-3.1.3 9dd8859240703d886941733ad32c1dc6f50d64f0 still worked fine
edit: this seems to have gradually degraded
- 9dd8859240703d886941733ad32c1dc6f50d64f0 still worked fine
- sometime after it started to print the
ERROR: Failed to acquire swapchain texture:while resizing and look funny, but still keep working after - afdf325fb4090e93a124519d1a3bc1fbe0ba9025 breaks it totally
edit2: this is on a linux x11 NVIDIA device (555.58.02)
Force quitting hung the whole x11 session for 1sec.
$ git bisect good
afdf325fb4090e93a124519d1a3bc1fbe0ba9025 is the first bad commit
commit afdf325fb4090e93a124519d1a3bc1fbe0ba9025
Author: Evan Hemsley <[email protected]>
Date: Mon Sep 30 10:23:19 2024 -0700
GPU: Add swapchain dimension out params (#11003)
include/SDL3/SDL_gpu.h | 22 ++-
src/dynapi/SDL_dynapi_procs.h | 2 +-
src/gpu/SDL_gpu.c | 12 +-
src/gpu/SDL_sysgpu.h | 4 +-
src/gpu/d3d11/SDL_gpu_d3d11.c | 23 ++-
src/gpu/d3d12/SDL_gpu_d3d12.c | 26 ++-
src/gpu/metal/SDL_gpu_metal.m | 16 +-
src/gpu/vulkan/SDL_gpu_vulkan.c | 492 +++++++++++++++++++++++------------------------
src/render/gpu/SDL_render_gpu.c | 19 +-
test/testgpu_simple_clear.c | 2 +-
test/testgpu_spinning_cube.c | 6 +-
11 files changed, 343 insertions(+), 281 deletions(-)
#11003
A few things that'll help us diagnose:
- Is there a specific test app that exhibits this behavior?
- Does this also happen via Xwayland?
- Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.
A few things that'll help us diagnose:
- Is there a specific test app that exhibits this behavior?
The test/testgpu_spinning_cube simply exits as soon as first
Failed to acquire swapchain texture:
is encountered. I checked and be401dd1e35c08baaf44000f031b81951698fc10 introduced this behavoir. This seems to be intended, but I am not sure it actually is an error that is reported.
The test/testnative executable however exhibits my issue perfectly. Just resize it until it hangs the screen or stops rendering (but keep running).
https://github.com/user-attachments/assets/4fe02b0f-dda4-40f0-a620-41b24e9da039
(includes a lack of frames at the x11(?) freeze)
- Does this also happen via Xwayland?
- Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.
Not sure how to enable the validation layers, but I will keep trying. On my x11-nvidia nixos setup I am not comfortable switching to wayland yet, however that is on my longterm todo list :)
Curious if this is possibly related to #9698
On an AMD card on X11 I get errors and sometimes a validation layer message when resizing, then application continues fine:
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
VUID-VkSwapchainCreateInfoKHR-pNext-07781(ERROR / SPEC): msgNum: 1284057537 - Validation Error: [ VUID-VkSwapchainCreateInfoKHR-pNext-07781 ] | MessageID = 0x4c8929c1 | vkCreateSwapchainKHR(): pCreateInfo->imageExtent (width = 545, height = 462), which is outside the bounds returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR(): currentExtent = (width = 553, height = 468), minImageExtent = (width = 553, height = 468), maxImageExtent = (width = 553, height = 468). The Vulkan spec states: If a VkSwapchainPresentScalingCreateInfoEXT structure was not included in the pNext chain, or it is included and VkSwapchainPresentScalingCreateInfoEXT::scalingBehavior is zero then imageExtent must be between minImageExtent and maxImageExtent, inclusive, where minImageExtent and maxImageExtent are members of the VkSurfaceCapabilitiesKHR structure returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR for the surface (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkSwapchainCreateInfoKHR-pNext-07781)
Objects: 0
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
I think we ended up removing the extent checks because we thought the window events covered it, but it seems X11 has other ideas - I think all we need to revert from the bad commits is the removal of min/max size checks and this will work again.
This may have been fixed by https://github.com/libsdl-org/SDL/commit/6ae5666acf911d924e8deb6d5dba87c27a71f46c. Someone who can repro will have to confirm.
@thatcosmonaut I did check yesterday, but no change.
@Green-Sky Could you try testing this PR: https://github.com/libsdl-org/SDL/pull/11139
@thatcosmonaut the pr does not change the behavior.
@thatcosmonaut the pr does not change the behavior.
Can confirm, issue is persisting for me on PopOS 22.04/Kernel 6.9.3-76060903-generic/X11/NVIDIA 560.35.03
We may need additional help with this one as I'm pretty sure all of us are on Wayland systems at this point, and I haven't seen this with Xwayland or Wayland in my own testing of FNA's swapchains. If any X-perts want to volunteer we'd really like to reassign this so cosmonaut can focus on threading and fragment storage writes.
I can reproduce with spinning cube in my debian VM (which i don't think is using wayland). After a few resizes it segfaults and my compositor seems to restart (screen goes black and journalctl log shows a bunch of XCB errors + a bunch of hardware info dumps from kwin_x11).
testnative is fine though no matter how many times I resize it.
valgrind shows some errors:
==127318== Invalid read of size 8
==127318== at 0x4A10D7D: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318== by 0x4A10D7D: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318== by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318== by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318== Address 0x5cf1480 is 16 bytes after a block of size 16 alloc'd
==127318== at 0x48407B4: malloc (vg_replace_malloc.c:381)
==127318== by 0x4929F92: SDL_malloc_REAL (SDL_malloc.c:6452)
==127318== by 0x4A0BDDB: VULKAN_INTERNAL_CreateUniformBuffer (SDL_gpu_vulkan.c:6802)
==127318== by 0x4A0BDDB: VULKAN_CreateDevice (SDL_gpu_vulkan.c:11681)
==127318== by 0x48CAAE4: SDL_CreateGPUDeviceWithProperties_REAL (SDL_gpu.c:529)
==127318== by 0x48CAB50: SDL_CreateGPUDevice_REAL (SDL_gpu.c:507)
==127318== by 0x10E5BF: init_render_state (testgpu_spinning_cube.c:514)
==127318== by 0x10DDA0: main (testgpu_spinning_cube.c:734)
==127318==
==127318== Invalid read of size 1
==127318== at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318== by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318== by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318== by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318== by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318== by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318== by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318== by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318== by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318== Address 0x81 is not stack'd, malloc'd or (recently) free'd
==127318==
==127318==
==127318== Process terminating with default action of signal 11 (SIGSEGV)
==127318== Access not within mapped region at address 0x81
==127318== at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318== by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318== by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318== by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318== by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318== by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318== by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318== by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318== by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318== by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318== If you believe this happened as a result of a stack
==127318== overflow in your program's main thread (unlikely but
==127318== possible), you can try to increase the size of the
==127318== main thread stack using the --main-stacksize= flag.
==127318== The main thread stack size used in this run was 8388608.
==127318==
==127318== HEAP SUMMARY:
==127318== in use at exit: 108,576,693 bytes in 9,778 blocks
==127318== total heap usage: 47,601 allocs, 37,823 frees, 189,414,856 bytes allocated
==127318==
==127318== LEAK SUMMARY:
==127318== definitely lost: 48 bytes in 1 blocks
==127318== indirectly lost: 0 bytes in 0 blocks
==127318== possibly lost: 814,704 bytes in 2,347 blocks
==127318== still reachable: 107,761,941 bytes in 7,430 blocks
==127318== suppressed: 0 bytes in 0 blocks
==127318== Rerun with --leak-check=full to see details of leaked memory
==127318==
==127318== For lists of detected and suppressed errors, rerun with: -s
==127318== ERROR SUMMARY: 6 errors from 4 contexts (suppressed: 0 from 0)
testnative is fine though no matter how many times I resize it.
Yes, SDL_Render no longer defaults to the SDL_GPU backend.
I worked with some of the devs on discord to dig in a little further. A few discoveries:
- There's a
// little hack for defragwhich uses->containerto smuggle aVulkanUniformBuffer *, which is technically maybe almost sort of safe except not really. Replacing that with a proper implementation gets to a segfault inside of defragmentation. - The segfault inside of defragmentation is here inside of DefragmentMemory:
And based on prodding it with gdb it seems like the
containeris a bad pointer but i.e. the buffer is valid. I suspect this is related to how the window's containers are managed specially vs other containers, perhaps it's a dangling pointer to containers from the window before the resize that are no longer a live allocation. I've handed this off to the others to dig in further.
Thread 1 "testgpu_spinnin" received signal SIGSEGV, Segmentation fault.
VULKAN_INTERNAL_DefragmentMemory (renderer=0x55555568f8a0) at /home/kate/Projects/SDL/src/gpu/vulkan/SDL_gpu_vulkan.c:10648
10648 newBuffer = VULKAN_INTERNAL_CreateBuffer(
(gdb) info locals
allocation = 0x555555847e80
currentRegion = 0x555555a95280
newBuffer = 0x7ffff7ee8d7d <VULKAN_INTERNAL_PerformPendingDestroys+1527>
newTexture = 0x555cdd84
bufferCopy = {srcOffset = 140737488346280, dstOffset = 140737352865038, size = 140737488344816}
imageCopy = {srcSubresource = {aspectMask = 42893, mipLevel = 0, baseArrayLayer = 4113016360, layerCount = 0}, srcOffset = {x = -10656, y = 32767, z = -135368401}, dstSubresource = {
aspectMask = 32767, mipLevel = 1434737952, baseArrayLayer = 21845, layerCount = 1432942752}, dstOffset = {x = 21845, y = -9048, z = 32767}, extent = {width = 4159477006, height = 32767,
depth = 1434689552}}
commandBuffer = 0x555555b889e0
srcSubresource = 0x0
dstSubresource = 0x555555a46840
i = 0
subresourceIndex = 4294967295
__func__ = "VULKAN_INTERNAL_DefragmentMemory"
(gdb) p $_siginfo._sifields._sigfault.si_addr
$4 = (void *) 0x555000fc09dc
(gdb) print *currentRegion->vulkanBuffer->container
Cannot access memory at address 0x555000fc09bc
(gdb) print currentRegion->vulkanBuffer->buffer
$5 = (VkBuffer) 0x5555558375f0
(gdb) print *currentRegion->vulkanBuffer->buffer
$6 = <incomplete type>
Okay, I've pushed @kg's fix, which seems to resolve the memory access issues, and then one on top of it which fixes swapchain texture acquisition over here for me. Please retest the latest in main asap if you were having problems with this, it's our last bug before shipping 3.2.0! :)
@kg had some extra difficulties she resolved--we think they were exposed by running in a virtual machine--plus some other good fixes, in that last commit.
I can confirm, the spinning cube does not longer crash or hang when resizing on latest master 🥳 .
I double checked with asan enabled.
I also bisected and 6d5815d was the one that fixed it for me.