sentry-native Minidump files get placed in the wrong directory and aren't uploaded to Sentry

Description

I'm using the Native SDK with the bundled crashpad_handler, and I noticed that crashes for some users wouldn't get uploaded to Sentry. After some digging I realized that for those users that didn't get their crashes uploaded, the minidump (dmp) files got placed in a different directory from the one configured with sentry_options_set_database_path. The directory they get placed in is [user]/Local/AppData/CrashDumps. It only happens for some users, but for those this is consistent.

Does anyone know what could be the cause of this? Or what I could do to debug it further? I've looked in the documentation and source code, but I haven't been able to figure out how I get the crashpad_handler to log stuff.

Thanks in advance!

When does the problem happen

[ ] During build
[ ] During run-time
[X] When capturing a hard crash

Environment

OS: Microsoft Windows 10 Business, 64 bit, Version 10.0.19045 Build 19045
Compiler: Visual Studio 2022, 17.3.6
CMake version and config: 3.25.2, -DBUILD_SHARED_LIBS=OFF -DSENTRY_BUILD_RUNTIMESTATIC=ON -DSENTRY_BACKEND=crashpad

Steps To Reproduce

It's only reproducible on some machines, but this is how I set up the crash handler in my app:

	sentry_options_t* options = sentry_options_new();
	sentry_options_set_dsn(options, "<a dsn value>");
	sentry_options_set_debug(options, true);
	sentry_options_set_environment(options, "Development");
	sentry_options_set_release(options, "0");
	sentry_options_set_handler_path(options, "<working directory>");
	sentry_options_set_database_path(options, "<working directory>/sentry-db");
	sentry_init(options);
	sentry_set_transaction("<app name>");
	sentry_set_level(SENTRY_LEVEL_WARNING);

Log output

I've compared the Sentry Native output between a computer where everything is working fine and one where it is not, and they are identical. I don't know how to get output from the crashpad_handler.

Apr 17 '23 15:04 mntt-SSE

Hi @mntt-SSE! I could not look into this on any of my Windows machines yet, but it seems you can access both types of machines to verify behavior and configuration.

Can you look in the registry to see whether the machines differ in the settings for HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps? I could imagine some unexpected interaction with the crashpad WER module.

Apr 17 '23 19:04 supervacuus

Thanks for your answer! On a machine where the problem exists, the LocalDumps entry looks like this:

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

On a machine where Sentry works as expected, no LocalDumps entry exists at all under Windows Error Reporting. I verified with 2 machines that don't work and 1 that does work, and they fit the pattern.

Apr 18 '23 12:04 mntt-SSE

Thx for verifying, @mntt-SSE; that is what I expected. LocalDumps bypasses application-local handling in favor of a centralized dump approach. It should be possible to deactivate it for your application by creating a new key (similar to the other apps below the LocalDumpskey), but I haven't tested it yet. Here are the docs for the key: https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps

Apr 18 '23 12:04 supervacuus

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

Maybe I should also say, that what is interesting are the values in LocalDumps, not the sub-keys which probably just provide a counter config because those affect all applications not listed below.

Apr 18 '23 13:04 supervacuus

For further root-causing, it would certainly help if you could also gather the debug output of the Native SDK on the systems where the dump happens in %LOCALAPPDATA%\CrashDumps; maybe something obvious is already visible there.

Apr 18 '23 14:04 supervacuus

The LocalDumps key itself does not contain any values other than (Default), which has no data.

I have so far tried to:

Create a key under LocalDumps with the name of the application, and set its DumpType value to 0.
Create a key under Windows Error Reporting called ExcludedApplications, and add a DWORD value with the name of the application as the key, and 1 as the value.
Under the ExcludedApplications key, create another key with the name of the application.
Call the Windows API function WerAddExcludedApplication. This did the same as 2.

I tried restarting the computer after each step, but nothing worked :/ I also checked the corresponding keys under HKEY_CURRENT_USER, but there is nothing there.

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

Apr 19 '23 15:04 mntt-SSE

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

The LocalDumps key itself does not contain any values other than (Default), which has no data.

That means that it is configured to use the default values.

Apr 24 '23 13:04 supervacuus

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 2 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 520m" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "executing task on worker threa" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "sending request using winhttp to <address>:
x-sentry-auth:<auth> sentry_key=<key>, sentry_version=7, sentry_client=sentry.native/0.5.3 
content-type:application/x-sentry-envelope 
content-length:17594 " (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 41 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 243m" (level=-1)

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

sentry_init() returns 0 on all machines I've tested it on. And I can also see that the crashpad_handler.exe process is starting when the application is started, and quits when the application crashes (or when it exits nicely).

Apr 24 '23 14:04 mntt-SSE

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

Okay, the logs are wrapped, interleaved, and incomplete. They also show some other transmission (maybe a manual event) but not the crash (because with crashpad, you would not see a transmission from within our worker_thread).

Maybe it is easier to debug this issue with the plain sentry_example.exe. In any case, I would expect a program configured as you described it in your initial description to print all these things - at some point in its execution - until it crashes:

[sentry] INFO using database path "your_database_path"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
[sentry] DEBUG starting crashpad backend with handler "your_handler_path"
[sentry] DEBUG using minidump url "your_dsn_url"
[sentry] DEBUG registering crashpad WER handler "your_wer_module_dll"
[sentry] INFO started crashpad client handler
[sentry] DEBUG processing and pruning old runs
[sentry] INFO flushing session and queue before crashpad handler
[sentry] INFO handing control over to crashpad

Since you sentry_init() successfully (at least according to the return value), I am especially interested if you see the last two lines. If those never appear, then the client-side crash-handler is never invoked, which means that the crashpad_handler will not be triggered (even if it is running).

You can also verify this externally by deleting your database path, executing your program, provoking a crash, and looking into (the newly created) path; there must be a last_crash marker file in there.

Apr 24 '23 15:04 supervacuus

I have been plagued by the same issue, that some machines just refuse to send any reports. After reading this discussion and talking to @supervacuus on discord I spent a day trying a bunch of things:

To start by answering the last question, if the 2 last lines of the sentry output: Those lines does appear on working machines and does not on the machines that doesn't work. This indicates that the UnhandledExceptionFilter is not run, so I started trying to figure out why.

The first thing that proved that it was not a Windows issue (as a lot of blogs and discussions around the web indicates it could be) was that it worked to do sentry_reinstall_backend() just before triggering the crash.

So something in the startup process overwrites our exception handler, and I'm pretty sure it is not any of our dependencies since it should behave the same for all machines in that case, and I couldn't find any reference to SetUnhandledExceptionFilter. Not sure what else it could be, some driver or something? I tried inspecting loaded dlls but didn't see anything significant, but I assume it could be something that gets loaded and then released during startup.

I tried the sentry_example.exe which seems to work fine on all machines, so I would assume it has to do with it being a gui application.

I finally tried replacing the SetUnhandledExceptionFilter with a simple return by overwriting the function address and it works perfectly, but it feels very sad. Searching a bit on github though this seems to be the way people do it. Maybe it is something sentry-native should do to increase its reliability? I assume if you use sentry-native you want it to have full control of these things. But at the same time it is a bit harsh...

May 13 '23 10:05 Cyriuz

Thank you @Cyriuz and @supervacuus for looking into this!

I have also done some more investigation. Firstly, the reason I didn't get the full Sentry logs was because I initialized Sentry before the other log handling. When I switched the order I got output identical to yours @supervacuus, except for the registering crashpad WER handler "your_wer_module_dll", where it says no WER handler is registered.

Secondly, I have found a workaround. I'm building a Qt application (Qt 5.15.8), and I noticed that if I initialize Sentry after the first window is shown (using QWidget::show()), then everything works as expected, which strengthen your theory about it haviing to do with UI @Cyriuz . So now I'm flashing a tiny window before the main one, which is far from ideal, but at least I'm getting the events. Would still be good to have another solution though.

May 15 '23 07:05 mntt-SSE

If you run this after sentry_init it works for me:

void lockUnhandledExceptionFilter()
{
	HMODULE kernel = GetModuleHandle("kernel32.dll");
	if (!kernel)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating kernel32 module.\n");
		return;
	}

	FARPROC setUnhandledExceptionFilter = GetProcAddress(kernel, "SetUnhandledExceptionFilter");
	if (!setUnhandledExceptionFilter)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating SetUnhandledExceptionFilter.\n");
		return;
	}

	#ifdef Q_OS_WIN64
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC3 // ret
	};
	#else
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC2, 0x04, 0x00, // ret 4
	};
	#endif
	SIZE_T bytes = 0;
	if (WriteProcessMemory(GetCurrentProcess(), (LPVOID)setUnhandledExceptionFilter, (LPCVOID)newBody, sizeof(newBody), &bytes) == 0)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed writing process memory.\n");
	}
}

I tried to simplify the other examples I found as much as possible but I'm no Windows programmer so take it with a grain of salt.

The best way to get to the bottom of what it is that clears the filter would be to make a proper hook, and print the callstack in there, but its somewhat involved and I'm not sure what I would do with the info anyway since I guess it theoretically could be multiple things.

To get the WER module you have to build sentry-native with CRASHPAD_WER_ENABLED

May 15 '23 08:05 Cyriuz

That is some very good feedback, thanks for the investigation!

We have indeed heard reports before that Qt / display drivers are constantly overwriting the UnhandledExceptionFilter, similar to managed language runtimes.

While it would be possible to hook / lock overwriting it, we decided against that, as it is a very deep "incision" that we don’t want to do by default.

May 15 '23 10:05 Swatinem

Fair enough. Although I wonder if it can still be considered somewhat standard as so many other projects seem to do the same? And considering that sentry-native is, in my understanding at least, supposed to take ownership of the entire crash reporting pipeline I feel like there might be an argument for including it, I guess it could even be optional? Or maybe it could just be a good topic for a page on the sentry docs as a workaround so people can find it easier? I assume it potentially affects most people that use sentry native on Windows.

May 15 '23 11:05 Cyriuz

IMO hooking the UnhandledExceptionFilter completely is a really big hammer. We might be breaking apps in surprising ways that we can’t predict yet.

There are legitimate uses for temporarily using it and then restoring the original handler afterwards. We don’t want to break those usecases. Although as the examples show, restoring the original handler is often not done correctly by other tools either.

We might want to offer this either as a compile time flag or an explicit runtime API though, wdyt @supervacuus ? Then the responsibility of (not) breaking any unpredictable usecases is up to the app author.

May 15 '23 14:05 Swatinem

Just going to throw in my 2 cents here. Had a very similar issue with regards to QT (5.15.8), the moment it was initialized the entire crash reporting seized to function. There we're no registry keys, or misplaced dumps, in fact there was absolutely nothing.

A quick test basically confirmed what I suspected, I would throw and exception right after sentry initialized, it worked. Did the same right after calling QApplication a(argc, argv); nope back to being broken.

Weirdly enough it worked on some Windows machines just fine (e.g: server, ...) but on others basically nothing worked. No dumps, no logs (only the one indicating that sentry had initialized), no indication that something went wrong whatsoever.

Jun 14 '23 10:06 W2Wizard

sentry-native sentry-native copied to clipboard

Minidump files get placed in the wrong directory and aren't uploaded to Sentry

sentry-native
sentry-native copied to clipboard