terminal icon indicating copy to clipboard operation
terminal copied to clipboard

Updating GPU drivers closed all terminals

Open muchcharles opened this issue 2 years ago • 33 comments

Windows Terminal version

1.18.1462.0

Windows build number

Version 10.0.19045.2965

Other Software

No response

Steps to reproduce

Open a terminal and have some work in progress. Install new nvidia drivers.

Expected Behavior

Terminal windows don't close when updating GPU drivers.

Actual Behavior

During the GPU driver reset as part of the upgrade (when screen blanks), all terminal windows close and you lose all of your work.

muchcharles avatar Jun 15 '23 00:06 muchcharles

Windows Terminal probably crashed when you upgraded your driver. Is there any chance you have a crash report in the Event Viewer for Windows Terminal?

image

Or do you maybe have any dumps in %LOCALAPPDATA%\CrashDumps?

I just upgraded my Nvidia graphics driver and didn't have any crashes, so it probably at least doesn't happen all the time. What graphics driver did you upgrade?

Edit: Discord on the other hand did crash after the upgrade. 😄

lhecker avatar Jun 15 '23 15:06 lhecker

It does have an entry in event viewer:

image

And there is a dump as well.

It was upgrading to Nvidia driver 536.23

muchcharles avatar Jun 15 '23 16:06 muchcharles

nvwqf2umx.dll

That might be a bug in the Nvidia driver, I think? If you're willing to share the dump with us, you can share it either on GitHub or send it to us via mail. You can find my personal mail address on my GitHub profile, and I'll then forward it to the rest of the team. However, it might contain parts of your terminal's contents.

lhecker avatar Jun 15 '23 17:06 lhecker

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

What's going on with the bot? ._. Edit: Commenting fixes it? Buggy bot...

lhecker avatar Jun 19 '23 20:06 lhecker

I'm sorry, I think the bot is having an episode.

DHowett avatar Jun 19 '23 20:06 DHowett

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.

Not stale, sent attachment via email

muchcharles avatar Jun 19 '23 20:06 muchcharles

I forgot to comment back... I tried to use the dump but unfortunately it has been corrupted during the crash. The stack trace contains 3 function calls where the second one is at address 0 and the usual WinDbg helpers failed to pull out any additional information as well.

I'll close the issue for now because of this, but I'd be happy to reopen it if it occurs again and we happen to get a valid crash dump. I suspect however that this is a bug in Nvidia's driver.

lhecker avatar Jul 12 '23 21:07 lhecker

It happened again today with the latest nvidia update.

On Wed, Jul 12, 2023 at 5:57 PM Leonard Hecker @.***> wrote:

Closed #15553 https://github.com/microsoft/terminal/issues/15553 as completed.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/terminal/issues/15553#event-9804364942, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTFWJM7H6POL5D5HXK2D3XP4MUNANCNFSM6AAAAAAZHCGVRY . You are receiving this because you authored the thread.Message ID: @.***>

muchcharles avatar Jul 28 '23 01:07 muchcharles

Only two places in the code check for DXGI_ERROR_DEVICE_REMOVED or DXGI_ERROR_DEVICE_RESET

/mnt/d/git/terminal$ ack DXGI_ERROR_DEVICE_REMOVED'|'DXGI_ERROR_DEVICE_RESET src/renderer/atlas/AtlasEngine.r.cpp 58: if (hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET)

src/renderer/dx/DxRenderer.cpp 1538: recreate = hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET; 1558: recreate = hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET;

Does the normal run loop, when not typing in to the console, call any DX12 API calls other than present?

Also, in the handling of those errors it only seems to retry 3 times for a total of less than a second:

// Add a bit of backoff.
// Sleep 150ms, 300ms, 450ms before failing out and disabling the renderer.
Sleep(renderBackoffBaseTimeMilliseconds * (maxRetriesForRenderEngine - tries));

Display blanking during driver updates and stuff can go for considerably more than 900ms. Also consider something like detaching a Surface screen shortly after resume from hibernate when things are swappy.

Note if you change maxRetriesForRenderEngine you should also change that sleep line. Should probably have a static_assert in place or a min to make sure the sleep doesn't go negative when changing that value.

muchcharles avatar Jul 28 '23 02:07 muchcharles

WDDMConEndUpdateDisplayBatch that swaps the swap chain textures doesn't seem to handle errors assuming these are the calls to it via EndPaint:

src/interactivity/win32/windowproc.cpp
829:    // BeginPaint/EndPaint does a bunch of other magic in the system level
832:    // We've tried in the past to not call BeginPaint/EndPaint
847:    LOG_IF_WIN32_BOOL_FALSE(EndPaint(hwnd, &ps));

src/renderer/base/renderer.cpp
134:        LOG_IF_FAILED(pEngine->EndPaint());

muchcharles avatar Jul 28 '23 02:07 muchcharles

It happened again today with the latest nvidia update.

Ah, fuck... If it's a recurring issue, I'll leave it open.

Only two places in the code check for DXGI_ERROR_DEVICE_REMOVED or DXGI_ERROR_DEVICE_RESET

The crash is definitely not due to any of those error codes, nor due to the retrying. The hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET check is only there to decide whether an error is an expected or unexpected. If it's an expected one (either of the 2 error codes) it'll retry indefinitely. But if it's neither of the 2 then the retry loop you found will only try up to 3 times to rerender the contents. If that fails, the application won't simply exit/crash, however. Instead, you'll see this (last error code + a prompt to resume rendering):

It's also not the EndPaint(hwnd, &ps) either, because src/interactivity is code specific to conhost (the old console window) and is not used for Windows Terminal. WDDMConEndUpdateDisplayBatch is used for our WDDM renderer, because we also build and ship the conhost used for the Windows recovery image, etc., which don't use DirectX.

I suspect that this issue either happens due to a driver bug or due to us accessing a nullptr member when the renderer is gone. I think both of these are about equally likely and also equally hard to debug. 🫤 You could try sending me your latest crash dump again. Maybe it's uncorrupted this time?

lhecker avatar Jul 28 '23 10:07 lhecker

I had NDA'd stuff in the terminal and can't send.

Is it possible to switch off GPU rendering to WDDM or is that tied to the context the terminal is being used in? Stability of the terminal is the highest priority.

muchcharles avatar Jul 28 '23 21:07 muchcharles

I found the software rendering setting, I'll switch to that if it doesn't spin a core when not drawing anything or something, or slow down compiles which has been a problem with windows terminal in the past.

I can open the dump for you if there are instructions to get the project up easily at the right revision somewhere.

I'm on

Windows Terminal Preview Version: 1.18.1462.0

muchcharles avatar Jul 28 '23 21:07 muchcharles

In the Microsoft Store you can find "WinDbg" (and WinDbg Preview - either are fine). If you open the crash dump with WinDbg it'll automatically resolve symbols for you. In the "Command" window you can type !analyze -v for an initial triage of the issue. The "Stack" window (bottom right, in a tab, by default) shows the stack trace which is also important. You can also get a textual stack trace in the Command window by typing k. If you type kp you get a stack trace with function parameters. If you type ~* kp you get that but for all running threads. In short, the commands !analyze -v and ~* kp might contain the most helpful information.

lhecker avatar Jul 29 '23 00:07 lhecker

Looks like more crash info than last time and not in a GPU driver call:

image


FAULTING_SOURCE_FILE:  C:\a\_work\1\s\src\renderer\atlas\BackendD3D.cpp

FAULTING_SOURCE_LINE_NUMBER:  1481

FAULTING_SOURCE_CODE:  
No source found for 'C:\a\_work\1\s\src\renderer\atlas\BackendD3D.cpp'


SYMBOL_NAME:  Microsoft_Terminal_Control!Microsoft::Console::Render::Atlas::BackendD3D::_drawSoftFontGlyph+19e

muchcharles avatar Jul 29 '23 17:07 muchcharles

That's amazing! Thank you so much for spending time on this! I'm sure you're not actually using soft fonts (soft fonts are custom fonts you can specify via VT), right? This indicates that we have a bug where we either don't properly reset the BackendD3D::_glyphAtlasMap hashmap or the ShapedRow::mapping vectors.

If you go up in the callstack twice into _drawText you should be here: https://github.com/microsoft/terminal/blob/57b9549ff814b52cdcf2bf36c69488db22abcde2/src/renderer/atlas/BackendD3D.cpp#L1012

Can you see the contents of the row variable? This one: https://github.com/microsoft/terminal/blob/57b9549ff814b52cdcf2bf36c69488db22abcde2/src/renderer/atlas/BackendD3D.cpp#L970

A screenshot of its contents would be helpful, but in particular its mappings member would be interesting: Both, how many items the mappings vector has and which ones / how many of them have a fontFace value of nullptr. (Hopefully WinDbg loads the embedded NatVis data correctly - in that case you'll have a much easier time inspecting those containers.)

If you have trouble viewing the data, I'll try to investigate those BackendD3D::_glyphAtlasMap hashmap and ShapedRow::mapping anyways. I'm pretty sure it should be some code that uses them. You helped me a ton already! Thanks!

lhecker avatar Jul 29 '23 18:07 lhecker

I don't think I'm using soft fonts, just running default ubuntu through WSL without much console customization.

Here's the locals for _drawText:

image

muchcharles avatar Jul 29 '23 22:07 muchcharles

Aww... The dump doesn't contain the contents of the heap. I'll try option 2 then: Re-read all the code that touches those two member variables I mentioned before, until I found it. 😄 If you can, please keep the dump file for a few more weeks, just in case. 🙂

lhecker avatar Jul 30 '23 10:07 lhecker

Oh I only just now noticed: image

Please correct me if I'm wrong @j4james (primary author around soft fonts), but I believe that a non-zero cellSize in the UpdateSoftFont call can only occur when a soft font is actually used right?

@muchcharles Could you perhaps be using soft-fonts unknowingly? Are you using any more "fanciful" VT applications through WSL? Or maybe the WSL output gets corrupted and the output is accidentally interpreted as a soft font?

lhecker avatar Aug 28 '23 16:08 lhecker

I don't think so, mostly just vim with very few plugins and normal shell command stuff. What are some of the big terminal apps that use soft fonts?

muchcharles avatar Aug 28 '23 17:08 muchcharles

Please correct me if I'm wrong @j4james (primary author around soft fonts), but I believe that a non-zero cellSize in the UpdateSoftFont call can only occur when a soft font is actually used right?

Yeah, it's only called with a non-zero size when the DECDLD sequence is finished processing here: https://github.com/microsoft/terminal/blob/c4436157c116880c320b5df278d4d886a930c0f3/src/terminal/adapter/adaptDispatch.cpp#L3899-L3902

The only other time it's called is when in response to an RIS hard reset, but that's with a zero size. See here: https://github.com/microsoft/terminal/blob/c4436157c116880c320b5df278d4d886a930c0f3/src/terminal/adapter/adaptDispatch.cpp#L3062-L3064

j4james avatar Aug 29 '23 10:08 j4james

Is it possible this is the result of the cellSize in the FontSettings structure not being initialized? See here:

https://github.com/microsoft/terminal/blob/c4436157c116880c320b5df278d4d886a930c0f3/src/renderer/atlas/common.h#L335-L343

As far as I can see the u16x2 structure isn't zero-initialized.

j4james avatar Aug 29 '23 10:08 j4james

@j4james The cellSize for soft fonts is here though: https://github.com/microsoft/terminal/blob/c4436157c116880c320b5df278d4d886a930c0f3/src/renderer/atlas/common.h#L364

Since it uses til::size it's always properly initialized.

But thanks for pointing this out! I'll go through all the *x* vector members in the code to fix them up. I've been planning to turn most of the u16 and u16x2 stuff into regular til::CoordType and til::point/size code anyways. Only few parts actually benefit from the tight packing as far as I can tell. Or maybe I should use int/int32_t? til::CoordType was mostly meant as buffer coordinates after all and not for pixel coordinates... (The vector types aren't zero-initialized by default so that they're trivially copyable. BackendD3D processes through dozens of MB/s at >60 FPS after all.)

lhecker avatar Aug 29 '23 12:08 lhecker

@j4james The cellSize for soft fonts is here though:

Yeah, sorry, I misread that.

But if it's not that, I have no idea how we're ending in this state. It seems like something must be getting corrupted at some point, and it's highly unlikely that the OP is using soft fonts if they aren't doing so consciously, so how do we end up in the _drawSoftFontGlyph method? Could it be something like the fontFaceEntry.fontFace field getting reset as a result of the driver update? Although that doesn't explain where that height value came from. I'm stumped.

j4james avatar Aug 29 '23 16:08 j4james