o3de icon indicating copy to clipboard operation
o3de copied to clipboard

Vulkan: Random crashes during Editor startup

Open galibzon opened this issue 1 year ago • 10 comments

Describe the bug Sometimes the Editor crashes during startup when vulkan is the active RHI.
The crash appears to occur 1 out of 5 times.

Assets required N/A

Steps to reproduce Use AutomatedTesting as game project in profile configuration. Start the Editor with command line -rhi=vulkan.
I also added -rhi-device-validation=enable, but it did not make a difference in the crash report.

Expected behavior The Editor starts properly and you can select any level to open

Actual behavior The Editor crashes with an Assert during startup. We can not even open a level from there.

Screenshots/Video N/A

Found in Branch commit https://github.com/o3de/o3de/commit/e7da962dffdd91f3b9367fc8552db8b4d968ee0c (HEAD -> development, origin/development) Author: Gene Walters [email protected] Date: Tue May 7 13:58:15 2024 -0700

** Branch Used When this Issue was intially reported ** o3de development branch: commit 7cf30b95bd4fd7a67fab026eb34c039287a7bef9 (HEAD -> development, origin/development) Merge: d56effbde4 42a2cb17e2 Author: antonmic [email protected] Date: Thu Dec 7 11:40:07 2023 -0800

Desktop/Device (please complete the following information):

  • Device: PC
  • OS: Windows
  • Version 11
  • CPU AMD Ryzen Threadripper 3970X 32-Core Processor 3.70 GHz
  • GPU NVidia RTX 4090 (24GB VRAM). Studio Driver Version 546.01
  • Memory 128GB
  • Storage: WD_BLACK 2TB SN850 NVMe M2 SDD PCIe 4.0

Additional context

Callstack:

>	Editor.exe!AZ::Debug::Platform::DebugBreak() Line 90	C++
 	Editor.exe!AZ::Debug::Trace::Assert(const char * fileName, int line, const char * funcName, const char * format, ...) Line 436	C++
 	Atom_RHI_Vulkan.Private.dll!AZ::Vulkan::AssertSuccess(VkResult result) Line 102	C++
 	Atom_RHI_Vulkan.Private.dll!AZ::Vulkan::Queue::SubmitCommandBuffers(const AZStd::vector<AZStd::intrusive_ptr<AZ::Vulkan::CommandList>,AZStd::allocator> & commandBuffers, const AZStd::vector<AZStd::pair<unsigned int,AZStd::intrusive_ptr<AZ::Vulkan::Semaphore>>,AZStd::allocator> & waitSemaphoresInfo, const AZStd::vector<AZStd::intrusive_ptr<AZ::Vulkan::Semaphore>,AZStd::allocator> & semaphoresToSignal, AZ::Vulkan::Fence * fenceToSignal) Line 97	C++
 	Atom_RHI_Vulkan.Private.dll!AZ::Vulkan::CommandQueue::ExecuteWork::__l2::<lambda_1>::operator()(void * queue) Line 75	C++
 	[Inline Frame] Atom_RHI_Vulkan.Private.dll!AZStd::function_intermediate<void,void *>::operator()(void * &&) Line 604	C++
 	[Inline Frame] Atom_RHI_Vulkan.Private.dll!AZStd::function<void __cdecl(void *)>::operator()(void * <args_0>) Line 684	C++
 	Atom_RHI_Vulkan.Private.dll!AZ::RHI::CommandQueue::ProcessQueue() Line 142	C++
 	[External Code]	

VS Console Log Output:

RHI: ****************************************************************
<09:00:20> (RHI) - ****************************************************************

RHI:                     Registering vulkan RHI                          
<09:00:20> (RHI) -                     Registering vulkan RHI                          

RHI: ****************************************************************
<09:00:20> (RHI) - ****************************************************************

The thread 0x7cb4 has exited with code 0 (0x0).
'Editor.exe' (Win32): Loaded 'C:\Windows\System32\XInput1_4.dll'. 
'Editor.exe' (Win32): Loaded 'C:\Windows\System32\InputHost.dll'. 
'Editor.exe' (Win32): Loaded 'C:\Windows\System32\psapi.dll'. 
System: 
==================================================================
<09:00:22> 
==================================================================

System: Trace::Assert
 C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI/Vulkan.h(102): (57060) 'void __cdecl AZ::Vulkan::AssertSuccess(enum VkResult)'
<09:00:22> Trace::Assert
 C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI/Vulkan.h(102): (57060) 'void __cdecl AZ::Vulkan::AssertSuccess(enum VkResult)'

System: ASSERT: Vulkan API method failed: Device lost
<09:00:22> ASSERT: Vulkan API method failed: Device lost

System: ------------------------------------------------
<09:00:22> ------------------------------------------------

'Editor.exe' (Win32): Loaded 'C:\Windows\System32\dbghelp.dll'. 
System: C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\Vulkan.h (102) : AZ::Vulkan::AssertSuccess
<09:00:22> C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\Vulkan.h (102) : AZ::Vulkan::AssertSuccess

System: C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\Queue.cpp (97) : AZ::Vulkan::Queue::SubmitCommandBuffers
<09:00:22> C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\Queue.cpp (97) : AZ::Vulkan::Queue::SubmitCommandBuffers

System: C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\CommandQueue.cpp (75) : `AZ::Vulkan::CommandQueue::ExecuteWork'::`2'::<lambda_1>::operator()
<09:00:22> C:\GIT\o3de\Gems\Atom\RHI\Vulkan\Code\Source\RHI\CommandQueue.cpp (75) : `AZ::Vulkan::CommandQueue::ExecuteWork'::`2'::<lambda_1>::operator()

System: C:\GIT\o3de\Gems\Atom\RHI\Code\Source\RHI\CommandQueue.cpp (142) : AZ::RHI::CommandQueue::ProcessQueue
<09:00:22> C:\GIT\o3de\Gems\Atom\RHI\Code\Source\RHI\CommandQueue.cpp (142) : AZ::RHI::CommandQueue::ProcessQueue

System: C:\GIT\o3de\Code\Framework\AzCore\Platform\Common\WinAPI\AzCore\std\parallel\internal\thread_WinAPI.cpp (38) : AZStd::Internal::thread_run_function
<09:00:22> C:\GIT\o3de\Code\Framework\AzCore\Platform\Common\WinAPI\AzCore\std\parallel\internal\thread_WinAPI.cpp (38) : AZStd::Internal::thread_run_function

System: 00007FFFBB329363 (ucrtbase) : recalloc
<09:00:22> 00007FFFBB329363 (ucrtbase) : recalloc

System: 00007FFFBD7A26AD (KERNEL32) : BaseThreadInitThunk
<09:00:22> 00007FFFBD7A26AD (KERNEL32) : BaseThreadInitThunk

System: 00007FFFBDAEA9F8 (ntdll) : RtlUserThreadStart
<09:00:22> 00007FFFBDAEA9F8 (ntdll) : RtlUserThreadStart

System: ==================================================================
<09:00:22> ==================================================================

A breakpoint instruction (__debugbreak() statement or a similar call) was executed in Editor.exe.

Here is the Editor.Log: Editor.log

galibzon avatar Dec 08 '23 14:12 galibzon

FYI @moudgils & @akioCL

galibzon avatar Dec 08 '23 15:12 galibzon

@galibzon out of curiosity... What is the full command one can use to run with Vulkan? Or what are the necessary steps to do it?

Always ran on default dx, and I cannot find any mention of how to do it in docs.

mythrz avatar Dec 08 '23 16:12 mythrz

@galibzon out of curiosity... What is the full command one can use to run with Vulkan? Or what are the necessary steps to do it?

Always ran on default dx, and I cannot find any mention of how to do it in docs.

@mythrz , The command line argument: -rhi=vulkan
The alternative is for DX12 is:
-rhi=dx12 (But, this is already the default RHI on Windows.

galibzon avatar Dec 08 '23 17:12 galibzon

@galibzon out of curiosity... What is the full command one can use to run with Vulkan? Or what are the necessary steps to do it? Always ran on default dx, and I cannot find any mention of how to do it in docs.

The command line argument: -rhi=vulkan The alternative is for DX12 is: -rhi=dx12 (But, this is already the default RHI on Windows.

Also, lots of good tips are also found in the wiki: https://github.com/o3de/o3de/wiki, which can not be found on o3de.org.

galibzon avatar Dec 08 '23 17:12 galibzon

Thank you. Worked fine by simply running it @galibzon and I did not manage to crash it after x10. Did not run with automated tests/assert on a server though. Commit: https://github.com/o3de/o3de/commit/5aea1a28f7426333bf2c7dcc2c2516210ca9790c

mythrz avatar Dec 09 '23 14:12 mythrz

Try these steps and lets add the output after to help us decide which gpu work caused the crash

  • Uncomment AZ_FORCE_CPU_GPU_INSYNC within Gems/Atom/RHI/Code/Include/Atom/RHI.Reflect/Base.h
  • Recompile and run
  • When the gpu crashes the cpu will also crash/hang and allow you to inspect the main thread which should have called execute/commit on the work related to the pass that crashed the gpu.

moudgils avatar Dec 13 '23 18:12 moudgils

Today I did another fresh update of my NVIDIA Drivers to 546.33 (was 546.01) and this bug doesn't happen anymore. I was able to start the Editor 10 times consecutively and was not able to experience the crash.

galibzon avatar Dec 21 '23 17:12 galibzon

Reopening the issue as it still happens.

galibzon avatar May 09 '24 16:05 galibzon

@hosea1008 If your team has any Vk related gpu crash fixes consider bringing them over as that may address random gpu crashes like these.

moudgils avatar May 15 '24 17:05 moudgils

@hosea1008 If your team has any Vk related gpu crash fixes consider bringing them over as that may address random gpu crashes like these.

Hi, I went through our works in the engine repo and unfortunately I didn't found any related vulkan fixes for the engine.

But we did encounter some random crashes with vulkan RHI in some of our machines when we try to pick #17022 and its corresponding fixes (#17226 #17500 ) into our 2305.0 based version. In our case, some machine with lower driver version won't crash. And if we try to enable nsight aftermath to take look at the crash, it won't reproduce with nsight aftermath enabled. We finally gave up picking it.

I was running the community version these days with vulkan RHI but in my environment I am always having the nsight aftermath enabled so I didn't encounter any related crashes.

Sorry for not having actual fixes for that, not sure whether the above hints might help.

hosea1008 avatar May 16 '24 01:05 hosea1008