sentry-native
sentry-native copied to clipboard
Minidump files get placed in the wrong directory and aren't uploaded to Sentry
Description
I'm using the Native SDK with the bundled crashpad_handler, and I noticed that crashes for some users wouldn't get uploaded to Sentry. After some digging I realized that for those users that didn't get their crashes uploaded, the minidump (dmp) files got placed in a different directory from the one configured with sentry_options_set_database_path
. The directory they get placed in is [user]/Local/AppData/CrashDumps
. It only happens for some users, but for those this is consistent.
Does anyone know what could be the cause of this? Or what I could do to debug it further? I've looked in the documentation and source code, but I haven't been able to figure out how I get the crashpad_handler to log stuff.
Thanks in advance!
When does the problem happen
- [ ] During build
- [ ] During run-time
- [X] When capturing a hard crash
Environment
- OS: Microsoft Windows 10 Business, 64 bit, Version 10.0.19045 Build 19045
- Compiler: Visual Studio 2022, 17.3.6
- CMake version and config: 3.25.2, -DBUILD_SHARED_LIBS=OFF -DSENTRY_BUILD_RUNTIMESTATIC=ON -DSENTRY_BACKEND=crashpad
Steps To Reproduce
It's only reproducible on some machines, but this is how I set up the crash handler in my app:
sentry_options_t* options = sentry_options_new();
sentry_options_set_dsn(options, "<a dsn value>");
sentry_options_set_debug(options, true);
sentry_options_set_environment(options, "Development");
sentry_options_set_release(options, "0");
sentry_options_set_handler_path(options, "<working directory>");
sentry_options_set_database_path(options, "<working directory>/sentry-db");
sentry_init(options);
sentry_set_transaction("<app name>");
sentry_set_level(SENTRY_LEVEL_WARNING);
Log output
I've compared the Sentry Native output between a computer where everything is working fine and one where it is not, and they are identical. I don't know how to get output from the crashpad_handler.
Hi @mntt-SSE! I could not look into this on any of my Windows machines yet, but it seems you can access both types of machines to verify behavior and configuration.
Can you look in the registry to see whether the machines differ in the settings for HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps
? I could imagine some unexpected interaction with the crashpad WER module.
Thanks for your answer! On a machine where the problem exists, the LocalDumps entry looks like this:
And each entry has a DumpFolder
key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps
. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder
is also different from the one where I get the crash dumps (although that folder also exists, but is empty).
On a machine where Sentry works as expected, no LocalDumps
entry exists at all under Windows Error Reporting
. I verified with 2 machines that don't work and 1 that does work, and they fit the pattern.
Thx for verifying, @mntt-SSE; that is what I expected. LocalDumps
bypasses application-local handling in favor of a centralized dump approach. It should be possible to deactivate it for your application by creating a new key (similar to the other apps below the LocalDumps
key), but I haven't tested it yet. Here are the docs for the key: https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps
And each entry has a
DumpFolder
key with the data%PROGRAMDATA%\NVIDIA Corporation\CrashDumps
. All of them are the same, and none seem to point to the application I'm using, and theDumpFolder
is also different from the one where I get the crash dumps (although that folder also exists, but is empty).
Maybe I should also say, that what is interesting are the values in LocalDumps
, not the sub-keys which probably just provide a counter config because those affect all applications not listed below.
For further root-causing, it would certainly help if you could also gather the debug output of the Native SDK on the systems where the dump happens in %LOCALAPPDATA%\CrashDumps
; maybe something obvious is already visible there.
The LocalDumps
key itself does not contain any values other than (Default), which has no data.
I have so far tried to:
- Create a key under
LocalDumps
with the name of the application, and set itsDumpType
value to 0. - Create a key under
Windows Error Reporting
calledExcludedApplications
, and add a DWORD value with the name of the application as the key, and 1 as the value. - Under the
ExcludedApplications
key, create another key with the name of the application. - Call the Windows API function
WerAddExcludedApplication
. This did the same as 2.
I tried restarting the computer after each step, but nothing worked :/ I also checked the corresponding keys under HKEY_CURRENT_USER
, but there is nothing there.
I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.
I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.
What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.
I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init()
was still successful). Can you check if sentry_init()
returns successfully (i.e., doesn't return 1
) on those machines? From what you described, I don't think the crashpad_handler
is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).
The LocalDumps key itself does not contain any values other than (Default), which has no data.
That means that it is configured to use the default values.
What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.
Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):
[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK
Connection: keep-alive
Date: Mon, 24 Apr 2023 14:30:28 GMT
Content-Length: 2
Content-Type: application/json
Server: nginx
Vary: origin,access-control-request-method,access-control-request-headers
access-control-allow-origin: *
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after
x-envoy-upstream-service-time: 0
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 520m" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "executing task on worker threa" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "sending request using winhttp to <address>:
x-sentry-auth:<auth> sentry_key=<key>, sentry_version=7, sentry_client=sentry.native/0.5.3
content-type:application/x-sentry-envelope
content-length:17594 " (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK
Connection: keep-alive
Date: Mon, 24 Apr 2023 14:30:28 GMT
Content-Length: 41
Content-Type: application/json
Server: nginx
Vary: origin,access-control-request-method,access-control-request-headers
access-control-allow-origin: *
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after
x-envoy-upstream-service-time: 0
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 243m" (level=-1)
I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).
sentry_init()
returns 0
on all machines I've tested it on. And I can also see that the crashpad_handler.exe process is starting when the application is started, and quits when the application crashes (or when it exits nicely).
Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):
Okay, the logs are wrapped, interleaved, and incomplete. They also show some other transmission (maybe a manual event) but not the crash (because with crashpad
, you would not see a transmission from within our worker_thread
).
Maybe it is easier to debug this issue with the plain sentry_example.exe
. In any case, I would expect a program configured as you described it in your initial description to print all these things - at some point in its execution - until it crashes:
[sentry] INFO using database path "your_database_path"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
[sentry] DEBUG starting crashpad backend with handler "your_handler_path"
[sentry] DEBUG using minidump url "your_dsn_url"
[sentry] DEBUG registering crashpad WER handler "your_wer_module_dll"
[sentry] INFO started crashpad client handler
[sentry] DEBUG processing and pruning old runs
[sentry] INFO flushing session and queue before crashpad handler
[sentry] INFO handing control over to crashpad
Since you sentry_init()
successfully (at least according to the return value), I am especially interested if you see the last two lines. If those never appear, then the client-side crash-handler is never invoked, which means that the crashpad_handler
will not be triggered (even if it is running).
You can also verify this externally by deleting your database path, executing your program, provoking a crash, and looking into (the newly created) path; there must be a last_crash
marker file in there.
I have been plagued by the same issue, that some machines just refuse to send any reports. After reading this discussion and talking to @supervacuus on discord I spent a day trying a bunch of things:
To start by answering the last question, if the 2 last lines of the sentry output: Those lines does appear on working machines and does not on the machines that doesn't work. This indicates that the UnhandledExceptionFilter
is not run, so I started trying to figure out why.
The first thing that proved that it was not a Windows issue (as a lot of blogs and discussions around the web indicates it could be) was that it worked to do sentry_reinstall_backend()
just before triggering the crash.
So something in the startup process overwrites our exception handler, and I'm pretty sure it is not any of our dependencies since it should behave the same for all machines in that case, and I couldn't find any reference to SetUnhandledExceptionFilter
. Not sure what else it could be, some driver or something? I tried inspecting loaded dlls but didn't see anything significant, but I assume it could be something that gets loaded and then released during startup.
I tried the sentry_example.exe
which seems to work fine on all machines, so I would assume it has to do with it being a gui application.
I finally tried replacing the SetUnhandledExceptionFilter
with a simple return by overwriting the function address and it works perfectly, but it feels very sad. Searching a bit on github though this seems to be the way people do it. Maybe it is something sentry-native should do to increase its reliability? I assume if you use sentry-native you want it to have full control of these things. But at the same time it is a bit harsh...
Thank you @Cyriuz and @supervacuus for looking into this!
I have also done some more investigation. Firstly, the reason I didn't get the full Sentry logs was because I initialized Sentry before the other log handling. When I switched the order I got output identical to yours @supervacuus, except for the registering crashpad WER handler "your_wer_module_dll"
, where it says no WER handler is registered.
Secondly, I have found a workaround. I'm building a Qt application (Qt 5.15.8), and I noticed that if I initialize Sentry after the first window is shown (using QWidget::show()
), then everything works as expected, which strengthen your theory about it haviing to do with UI @Cyriuz . So now I'm flashing a tiny window before the main one, which is far from ideal, but at least I'm getting the events. Would still be good to have another solution though.
If you run this after sentry_init
it works for me:
void lockUnhandledExceptionFilter()
{
HMODULE kernel = GetModuleHandle("kernel32.dll");
if (!kernel)
{
printf("[Warning] LockUnhandledExceptionFilter: Failed locating kernel32 module.\n");
return;
}
FARPROC setUnhandledExceptionFilter = GetProcAddress(kernel, "SetUnhandledExceptionFilter");
if (!setUnhandledExceptionFilter)
{
printf("[Warning] LockUnhandledExceptionFilter: Failed locating SetUnhandledExceptionFilter.\n");
return;
}
#ifdef Q_OS_WIN64
static const BYTE newBody[] = {
0x33, 0xC0, // xor eax,eax
0xC3 // ret
};
#else
static const BYTE newBody[] = {
0x33, 0xC0, // xor eax,eax
0xC2, 0x04, 0x00, // ret 4
};
#endif
SIZE_T bytes = 0;
if (WriteProcessMemory(GetCurrentProcess(), (LPVOID)setUnhandledExceptionFilter, (LPCVOID)newBody, sizeof(newBody), &bytes) == 0)
{
printf("[Warning] LockUnhandledExceptionFilter: Failed writing process memory.\n");
}
}
I tried to simplify the other examples I found as much as possible but I'm no Windows programmer so take it with a grain of salt.
The best way to get to the bottom of what it is that clears the filter would be to make a proper hook, and print the callstack in there, but its somewhat involved and I'm not sure what I would do with the info anyway since I guess it theoretically could be multiple things.
To get the WER module you have to build sentry-native with CRASHPAD_WER_ENABLED
That is some very good feedback, thanks for the investigation!
We have indeed heard reports before that Qt / display drivers are constantly overwriting the UnhandledExceptionFilter
, similar to managed language runtimes.
While it would be possible to hook / lock overwriting it, we decided against that, as it is a very deep "incision" that we don’t want to do by default.
Fair enough. Although I wonder if it can still be considered somewhat standard as so many other projects seem to do the same? And considering that sentry-native is, in my understanding at least, supposed to take ownership of the entire crash reporting pipeline I feel like there might be an argument for including it, I guess it could even be optional? Or maybe it could just be a good topic for a page on the sentry docs as a workaround so people can find it easier? I assume it potentially affects most people that use sentry native on Windows.
IMO hooking the UnhandledExceptionFilter
completely is a really big hammer. We might be breaking apps in surprising ways that we can’t predict yet.
There are legitimate uses for temporarily using it and then restoring the original handler afterwards. We don’t want to break those usecases. Although as the examples show, restoring the original handler is often not done correctly by other tools either.
We might want to offer this either as a compile time flag or an explicit runtime API though, wdyt @supervacuus ? Then the responsibility of (not) breaking any unpredictable usecases is up to the app author.
Just going to throw in my 2 cents here. Had a very similar issue with regards to QT (5.15.8), the moment it was initialized the entire crash reporting seized to function. There we're no registry keys, or misplaced dumps, in fact there was absolutely nothing.
A quick test basically confirmed what I suspected, I would throw and exception right after sentry initialized, it worked. Did the same right after calling QApplication a(argc, argv);
nope back to being broken.
Weirdly enough it worked on some Windows machines just fine (e.g: server, ...) but on others basically nothing worked. No dumps, no logs (only the one indicating that sentry had initialized), no indication that something went wrong whatsoever.