yappi
yappi copied to clipboard
Fatal access violation in clear_stats (not thread-safe/coroutine-safe?)
I have a FastAPI webapp that has been experiencing intermittent crashes (with no logs) for a fairly long time, and these just recently started to become frequent enough for me to start looking into the issue. This is the same webapp I was working on when I was prototyping asgi-server-timing-middleware, and as it turns out, the source of the crash (a segfault) originates from Yappi's code.
Windows fatal exception: access violation
Thread 0x00002cec (most recent call first):
File ".pyenv\lib\site-packages\yappi.py", line 1306 in clear_stats
File "app\middleware\ServerTimingMiddleware.py", line 77 in wrapped_send
...
File ".pyenv\lib\site-packages\uvicorn\server.py", line 60 in run
File ".pyenv\lib\site-packages\uvicorn\main.py", line 569 in run
File "launch.py", line 16 in <module>
Windows fatal exception: code 0xebadebad
(trimmed error log to only include the thread that raised a fatal error, though there are a few more)
Faulting application name: python.exe, version: 3.9.13150.1013, time stamp: 0x6283d009
Faulting module name: KERNELBASE.dll, version: 10.0.17763.6530, time stamp: 0x2853070e
Exception code: 0xc0000005
Fault offset: 0x0000000000041b39
Faulting process id: 0x2c64
Faulting application start time: 0x01dbc2f6d6742aed
Faulting application path: C:\Program Files\Python39\python.exe
Faulting module path: C:\WINDOWS\System32\KERNELBASE.dll
Report Id: b621ca45-11c6-4786-8b79-b82cd3dfe6d7
Faulting package full name:
Faulting package-relative application ID:
(Windows application error raised at the same time)
Since Yappi does not implement incremental stats clearing (see #42), the only way I have of freeing profiler memory is to clear it when a certain memory threshold is reached. The cause of that crash seems to be (I've not managed to reproduce it on purpose yet) that, when that threshold is reached and a request response enters the clear_stats section while the internal stats are in a bad state, possibly from having been cleared by a competing coroutine and/or some kind of ToC-ToU issue, I'm not entirely sure.
That error log was recorded with an older yappi version (1.3.6, I believe), though the crash is still occuring with the latest 1.6.10. It's just that getting logs for this, even with the fault handler turned on, is proving to be difficult.
0xebadebad is an odd error value; it's not a standard Windows one and it's not one of their heap debugging patterns either. I'm not sure where that would be coming from, and Google is returning very few hits when I look it up.
Hmm. Thanks for reporting this.
Any hints on the threshold? Is it like %90 of memory, I mean can we be hitting (somewhat) to any OOM related thing here?
And is there any way you reproduce this on prod? Maybe we can use some debug binaries?
Hmm. Thanks for reporting this.
Any hints on the threshold? Is it like %90 of memory, I mean can we be hitting (somewhat) to any OOM related thing here?
And is there any way you reproduce this on prod? Maybe we can use some debug binaries?
Oh, it's something small, like 50 MB, on a Windows server with 95 GB, I highly doubt it's the whole system going OoM. It's a very intermittent crash; I've only seen it in prod, but I've had it happen twice in one day, and I've had it not happen for weeks at a time (it has, in fact, not occured since May 15th as of right now), but no luck getting it to trigger reliably so far. If you have ideas on what I could try to better pinpoint the issue (or even see if Yappi being in the stack trace isn't actually a red herring, which I'm not currently ruling out), I'd be interested to hear them.
Well, clear_stats being the final function in the stack trace seems really like a red herring for me.
But, TBH: clear_stats is a simple, old function. See the only interesting bits are:
henum(contexts, _ctxenumdel, NULL);
...
delete_tls_key(tl_prev_ctx_key);
I would expect one of these fails somehow. From the crash dump, it seems like we are failing on KERNELBase.dll. IIUC, this means we are in the middle of some Win API call and we don't do too many OS APIs in yappi. That is why I am suspecting more on delete_tls_key(tl_prev_ctx_key); this tries to delete the thread local storage value. Maybe we should try commenting that line out to see it it fixes the problem?
BTW, did you see these crashes always on Windows?
BTW, did you see these crashes always on Windows?
The only application I have running at a scale where that seems to trigger is running on a Windows server, yeah. I don't think I currently have a suitable Linux server to deploy that same app to.
Well,
clear_statsbeing the final function in the stack trace seems really like a red herring for me.But, TBH:
clear_statsis a simple, old function. See the only interesting bits are:henum(contexts, _ctxenumdel, NULL); ... delete_tls_key(tl_prev_ctx_key);
Yeah, i was looking at the PY_Clear call earlier, thinking maybe that could be some kind of race condition, but that macro seems to have all the safety precautions needed and Yappi doesn't let go of the GIL during that operation, so there's really not a whole lot of reasons why this would break.
I would expect one of these fails somehow. From the crash dump, it seems like we are failing on KERNELBase.dll. IIUC, this means we are in the middle of some Win API call and we don't do too many OS APIs in yappi. That is why I am suspecting more on
delete_tls_key(tl_prev_ctx_key);this tries to delete the thread local storage value. Maybe we should try commenting that line out to see it it fixes the problem?
The application just crashed again this morning, this time with no logs on the Python side or in the Windows system logs, despite faulthandler being activated. I'm becoming less and less certain that this has anything to do with Yappi, though I would certainly like to know what the hell is causing these random silent crashes. I'll try going the "extreme" route and disable Yappi on that app for a week or two, to see if the crashes still occur then, which would at least rule it out entirely as the root cause.
At this point, I'm even starting to wonder if there might not be some kind of hardware fault at play.
I'll try going the "extreme" route and disable Yappi
Would be useful to know!
@sm-Fifteen , not sure if you remember or test this issue at all, but did any of the crashes happen?
@sm-Fifteen , not sure if you remember or test this issue at all, but did any of the crashes happen?
They've not reoccured since I've turned it off, but that also uncovered another issue where some watchdog was killing the FastAPI service if it was detected as having been running for more than 4 days (which explains why most "crashes" didn't cause anything to be logged). I've not had any unexpected downtime since, so I'll have to try reenabling Yappi to see if that causes the crashing issue to return.
Ok, so I re-enabled the profiler a few weeks ago, and the access violation errors have started coming back, albeit infrequently. It's the same issue as before: Windows fatal exception: code 0xebadebad and Windows fatal exception: access violation raised by faulthandler, and almost the same error report logged by Windows' system logs:
Faulting application name: python.exe, version: 3.12.12150.1013, time stamp: 0x68eebfe7
Faulting module name: KERNELBASE.dll, version: 10.0.17763.6530, time stamp: 0x2853070e
Exception code: 0xc0000005
Fault offset: 0x0000000000041b39
Faulting process id: 0x4100
Faulting application start time: 0x01dc52036b48d944
Faulting application path: AppData\Roaming\uv\python\cpython-3.12.12-windows-x86_64-none\python.exe
Faulting module path: C:\WINDOWS\System32\KERNELBASE.dll
Report Id: 369066f9-570f-4880-aa66-b63e96ab030e
Faulting package full name:
Faulting package-relative application ID:
Here's also the full faulthandler stack trace for a crash that occured this morning: Windows fatal exception access viol.txt. I notice there are two calls to _get_context_tag while the crashing thread is calling clear_stats. Could be a coincidence, but I figured I'd point it out.
I'm using yappi v1.6.10 on Python 3.12.12.
Thank you @sm-Fifteen for this precious info!
I will have a look and try to identify what might be causing this...
If it's of any help, I notice most of the few hits I get when searching for 0xebadebad (1, 2, 3) seem to relate to Oracle in some way (but not all, like 4). That could be a coincidence, although I do use Oracle in these queries. I don't get any such crashes when Yappi is disabled, though, so I'm really unsure.