SDL Renderer GPU(vulkan): stops rendering after window resize

With the following error log:

ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:
ERROR: Failed to acquire swapchain texture:

The rest of the application continues ~fine~ sdl events seem to stop.

on https://github.com/libsdl-org/SDL/releases/tag/preview-3.1.3 9dd8859240703d886941733ad32c1dc6f50d64f0 still worked fine

edit: this seems to have gradually degraded

9dd8859240703d886941733ad32c1dc6f50d64f0 still worked fine
sometime after it started to print the ERROR: Failed to acquire swapchain texture: while resizing and look funny, but still keep working after
afdf325fb4090e93a124519d1a3bc1fbe0ba9025 breaks it totally

edit2: this is on a linux x11 NVIDIA device (555.58.02)

Oct 05 '24 12:10 Green-Sky

Force quitting hung the whole x11 session for 1sec.

Oct 05 '24 12:10 Green-Sky

$ git bisect good
afdf325fb4090e93a124519d1a3bc1fbe0ba9025 is the first bad commit
commit afdf325fb4090e93a124519d1a3bc1fbe0ba9025
Author: Evan Hemsley <[email protected]>
Date:   Mon Sep 30 10:23:19 2024 -0700

    GPU: Add swapchain dimension out params (#11003)

 include/SDL3/SDL_gpu.h          |  22 ++-
 src/dynapi/SDL_dynapi_procs.h   |   2 +-
 src/gpu/SDL_gpu.c               |  12 +-
 src/gpu/SDL_sysgpu.h            |   4 +-
 src/gpu/d3d11/SDL_gpu_d3d11.c   |  23 ++-
 src/gpu/d3d12/SDL_gpu_d3d12.c   |  26 ++-
 src/gpu/metal/SDL_gpu_metal.m   |  16 +-
 src/gpu/vulkan/SDL_gpu_vulkan.c | 492 +++++++++++++++++++++++------------------------
 src/render/gpu/SDL_render_gpu.c |  19 +-
 test/testgpu_simple_clear.c     |   2 +-
 test/testgpu_spinning_cube.c    |   6 +-
 11 files changed, 343 insertions(+), 281 deletions(-)

#11003

Oct 05 '24 13:10 Green-Sky

A few things that'll help us diagnose:

Is there a specific test app that exhibits this behavior?
Does this also happen via Xwayland?
Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.

Oct 05 '24 15:10 flibitijibibo

A few things that'll help us diagnose:

Is there a specific test app that exhibits this behavior?

The test/testgpu_spinning_cube simply exits as soon as first

Failed to acquire swapchain texture:

is encountered. I checked and be401dd1e35c08baaf44000f031b81951698fc10 introduced this behavoir. This seems to be intended, but I am not sure it actually is an error that is reported.

The test/testnative executable however exhibits my issue perfectly. Just resize it until it hangs the screen or stops rendering (but keep running).

https://github.com/user-attachments/assets/4fe02b0f-dda4-40f0-a620-41b24e9da039

(includes a lack of frames at the x11(?) freeze)

Does this also happen via Xwayland?

Do the Vulkan validation layers point to anything in particular? The SDL examples should enable them in debug mode, provided the system has them installed.

Not sure how to enable the validation layers, but I will keep trying. On my x11-nvidia nixos setup I am not comfortable switching to wayland yet, however that is on my longterm todo list :)

Oct 05 '24 19:10 Green-Sky

Curious if this is possibly related to #9698

Oct 05 '24 19:10 thatcosmonaut

On an AMD card on X11 I get errors and sometimes a validation layer message when resizing, then application continues fine:

ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
VUID-VkSwapchainCreateInfoKHR-pNext-07781(ERROR / SPEC): msgNum: 1284057537 - Validation Error: [ VUID-VkSwapchainCreateInfoKHR-pNext-07781 ] | MessageID = 0x4c8929c1 | vkCreateSwapchainKHR(): pCreateInfo->imageExtent (width = 545, height = 462), which is outside the bounds returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR(): currentExtent = (width = 553, height = 468), minImageExtent = (width = 553, height = 468), maxImageExtent = (width = 553, height = 468). The Vulkan spec states: If a VkSwapchainPresentScalingCreateInfoEXT structure was not included in the pNext chain, or it is included and VkSwapchainPresentScalingCreateInfoEXT::scalingBehavior is zero then imageExtent must be between minImageExtent and maxImageExtent, inclusive, where minImageExtent and maxImageExtent are members of the VkSurfaceCapabilitiesKHR structure returned by vkGetPhysicalDeviceSurfaceCapabilitiesKHR for the surface (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkSwapchainCreateInfoKHR-pNext-07781)
    Objects: 0
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR
ERROR: vkQueuePresentKHR VK_SUBOPTIMAL_KHR

Oct 06 '24 11:10 meyraud705

I think we ended up removing the extent checks because we thought the window events covered it, but it seems X11 has other ideas - I think all we need to revert from the bad commits is the removal of min/max size checks and this will work again.

Oct 06 '24 15:10 flibitijibibo

This may have been fixed by https://github.com/libsdl-org/SDL/commit/6ae5666acf911d924e8deb6d5dba87c27a71f46c. Someone who can repro will have to confirm.

Oct 09 '24 22:10 thatcosmonaut

@thatcosmonaut I did check yesterday, but no change.

Oct 09 '24 23:10 Green-Sky

@Green-Sky Could you try testing this PR: https://github.com/libsdl-org/SDL/pull/11139

Oct 09 '24 23:10 thatcosmonaut

@thatcosmonaut the pr does not change the behavior.

Oct 10 '24 09:10 Green-Sky

@thatcosmonaut the pr does not change the behavior.

Can confirm, issue is persisting for me on PopOS 22.04/Kernel 6.9.3-76060903-generic/X11/NVIDIA 560.35.03

Dec 19 '24 10:12 KitsuneAlex

We may need additional help with this one as I'm pretty sure all of us are on Wayland systems at this point, and I haven't seen this with Xwayland or Wayland in my own testing of FNA's swapchains. If any X-perts want to volunteer we'd really like to reassign this so cosmonaut can focus on threading and fragment storage writes.

Jan 09 '25 19:01 flibitijibibo

I can reproduce with spinning cube in my debian VM (which i don't think is using wayland). After a few resizes it segfaults and my compositor seems to restart (screen goes black and journalctl log shows a bunch of XCB errors + a bunch of hardware info dumps from kwin_x11).

testnative is fine though no matter how many times I resize it.

Jan 21 '25 00:01 kg

valgrind shows some errors:

==127318== Invalid read of size 8
==127318==    at 0x4A10D7D: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D7D: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  Address 0x5cf1480 is 16 bytes after a block of size 16 alloc'd
==127318==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==127318==    by 0x4929F92: SDL_malloc_REAL (SDL_malloc.c:6452)
==127318==    by 0x4A0BDDB: VULKAN_INTERNAL_CreateUniformBuffer (SDL_gpu_vulkan.c:6802)
==127318==    by 0x4A0BDDB: VULKAN_CreateDevice (SDL_gpu_vulkan.c:11681)
==127318==    by 0x48CAAE4: SDL_CreateGPUDeviceWithProperties_REAL (SDL_gpu.c:529)
==127318==    by 0x48CAB50: SDL_CreateGPUDevice_REAL (SDL_gpu.c:507)
==127318==    by 0x10E5BF: init_render_state (testgpu_spinning_cube.c:514)
==127318==    by 0x10DDA0: main (testgpu_spinning_cube.c:734)
==127318== 
==127318== Invalid read of size 1
==127318==    at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318==    by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318==    by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318==    by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  Address 0x81 is not stack'd, malloc'd or (recently) free'd
==127318== 
==127318== 
==127318== Process terminating with default action of signal 11 (SIGSEGV)
==127318==  Access not within mapped region at address 0x81
==127318==    at 0x4846782: strlen (vg_replace_strmem.c:494)
==127318==    by 0x121B61BE: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so)
==127318==    by 0x579587B: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x577F577: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan.so.1.3.239)
==127318==    by 0x4A072DE: VULKAN_INTERNAL_CreateBuffer (SDL_gpu_vulkan.c:4166)
==127318==    by 0x4A10D95: VULKAN_INTERNAL_DefragmentMemory (SDL_gpu_vulkan.c:10641)
==127318==    by 0x4A10D95: VULKAN_Submit (SDL_gpu_vulkan.c:10569)
==127318==    by 0x10EEB1: Render (testgpu_spinning_cube.c:457)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:677)
==127318==    by 0x10EEB1: loop (testgpu_spinning_cube.c:666)
==127318==    by 0x10DDC4: main (testgpu_spinning_cube.c:745)
==127318==  If you believe this happened as a result of a stack
==127318==  overflow in your program's main thread (unlikely but
==127318==  possible), you can try to increase the size of the
==127318==  main thread stack using the --main-stacksize= flag.
==127318==  The main thread stack size used in this run was 8388608.
==127318== 
==127318== HEAP SUMMARY:
==127318==     in use at exit: 108,576,693 bytes in 9,778 blocks
==127318==   total heap usage: 47,601 allocs, 37,823 frees, 189,414,856 bytes allocated
==127318== 
==127318== LEAK SUMMARY:
==127318==    definitely lost: 48 bytes in 1 blocks
==127318==    indirectly lost: 0 bytes in 0 blocks
==127318==      possibly lost: 814,704 bytes in 2,347 blocks
==127318==    still reachable: 107,761,941 bytes in 7,430 blocks
==127318==         suppressed: 0 bytes in 0 blocks
==127318== Rerun with --leak-check=full to see details of leaked memory
==127318== 
==127318== For lists of detected and suppressed errors, rerun with: -s
==127318== ERROR SUMMARY: 6 errors from 4 contexts (suppressed: 0 from 0)

Jan 21 '25 00:01 kg

testnative is fine though no matter how many times I resize it.

Yes, SDL_Render no longer defaults to the SDL_GPU backend.

Jan 21 '25 01:01 Green-Sky

I worked with some of the devs on discord to dig in a little further. A few discoveries:

There's a // little hack for defrag which uses ->container to smuggle a VulkanUniformBuffer *, which is technically maybe almost sort of safe except not really. Replacing that with a proper implementation gets to a segfault inside of defragmentation.
The segfault inside of defragmentation is here inside of DefragmentMemory: And based on prodding it with gdb it seems like the container is a bad pointer but i.e. the buffer is valid. I suspect this is related to how the window's containers are managed specially vs other containers, perhaps it's a dangling pointer to containers from the window before the resize that are no longer a live allocation. I've handed this off to the others to dig in further.

Thread 1 "testgpu_spinnin" received signal SIGSEGV, Segmentation fault.
VULKAN_INTERNAL_DefragmentMemory (renderer=0x55555568f8a0) at /home/kate/Projects/SDL/src/gpu/vulkan/SDL_gpu_vulkan.c:10648
10648               newBuffer = VULKAN_INTERNAL_CreateBuffer(
(gdb) info locals
allocation = 0x555555847e80
currentRegion = 0x555555a95280
newBuffer = 0x7ffff7ee8d7d <VULKAN_INTERNAL_PerformPendingDestroys+1527>
newTexture = 0x555cdd84
bufferCopy = {srcOffset = 140737488346280, dstOffset = 140737352865038, size = 140737488344816}
imageCopy = {srcSubresource = {aspectMask = 42893, mipLevel = 0, baseArrayLayer = 4113016360, layerCount = 0}, srcOffset = {x = -10656, y = 32767, z = -135368401}, dstSubresource = {
    aspectMask = 32767, mipLevel = 1434737952, baseArrayLayer = 21845, layerCount = 1432942752}, dstOffset = {x = 21845, y = -9048, z = 32767}, extent = {width = 4159477006, height = 32767, 
    depth = 1434689552}}
commandBuffer = 0x555555b889e0
srcSubresource = 0x0
dstSubresource = 0x555555a46840
i = 0
subresourceIndex = 4294967295
__func__ = "VULKAN_INTERNAL_DefragmentMemory"

(gdb) p $_siginfo._sifields._sigfault.si_addr
$4 = (void *) 0x555000fc09dc
(gdb) print *currentRegion->vulkanBuffer->container
Cannot access memory at address 0x555000fc09bc
(gdb) print currentRegion->vulkanBuffer->buffer
$5 = (VkBuffer) 0x5555558375f0
(gdb) print *currentRegion->vulkanBuffer->buffer
$6 = <incomplete type>

Jan 21 '25 01:01 kg

Okay, I've pushed @kg's fix, which seems to resolve the memory access issues, and then one on top of it which fixes swapchain texture acquisition over here for me. Please retest the latest in main asap if you were having problems with this, it's our last bug before shipping 3.2.0! :)

Jan 21 '25 03:01 icculus

@kg had some extra difficulties she resolved--we think they were exposed by running in a virtual machine--plus some other good fixes, in that last commit.

Jan 21 '25 04:01 icculus

I can confirm, the spinning cube does not longer crash or hang when resizing on latest master 🥳 .

I double checked with asan enabled.

I also bisected and 6d5815d was the one that fixed it for me.

Jan 21 '25 11:01 Green-Sky