hammerspoon Root causing our "weird" crash logs

So, this is not the first time I've decided I'm determined to get to the bottom of why we always have some "weird" crash reports, and it probably won't be the last time, but I'd like to discuss it again anyway.

Tagging @asmagill for his excellent thoughts.

Here are the "weird" crash reports received thus far for 0.9.84:

https://sentry.io/organizations/hammerspoon/issues/2264596621/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2264749660/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2264676679/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2263890376/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2264076080/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2263909041/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2252842063/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2262301597/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2260433876/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
https://sentry.io/organizations/hammerspoon/issues/2254695465/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d

Here are some evidence points I have so far:

They all start with an OS callback that leads to us calling protectedCallAndTraceback. There are a couple of others that happen during the first load of init.lua, but we'll ignore those for now.
Generally speaking they are all something to do with pointers not being what Lua is expecting (either NULL or garbage), which strongly suggests (imo) that this is some kind of lifecycle issue - ie they're storing a reference to something that's been subsequently deallocated.
We don't get very many of these - in the week since 0.9.84 was released these particular crashes only number around 20, out of thousands of user sessions. Sentry doesn't break down a ton of detail here (in part because we deliberately don't enable active session tracking), but they report well over 99% of sessions are crash free. Still, it annoys me to know that these crashes all hint at some kind of lifecycle bug we haven't found yet.
Looking at the backtraces, they mostly fall into either hs.timer or hs.eventtap callbacks, but there are others, which suggests to me that it's not a specific bug in either of those modules, but rather a systemic flaw.
We have protections against (at least):
- trying to call Lua off the main thread
- lua_State having been replaced since the callback was set up, at least in hs.timer (via the relatively new checkLuaSkinInstance in 0.9.83)
- Lua stack overgrowth via the _lua_stackguard macros
Looking at the breadcrumb logs from the crashes, it's not generally the case that the crash happens imminently after a LuaSkin reload
Because they all start from C, they call [LuaSkin sharedWithState:NULL] which guarantees them the main Lua State, they're not coming in on a co-routine state (this kind of crash report predates co-routines anyway AFAIK, but we don't have historical data to back that up). This would also be covered by the LuaSkin UUID check, at least in hs.timer.

So, it seems relatively safe to say that these aren't happening when Lua is being torn down, and their relative consistency makes me suspect they're not stack/heap corruption, although I can't rule that out. I have some hypotheses:

They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it's not possible)
Somewhere we're running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we're in the middle of a Lua->C situation and the Lua state gets confused
We just have one or more memory corruption bugs in some C code somewhere
We have a very subtle Lua API interaction bug
We're hitting genuine bugs in Lua (seems unlikely that they would persist for this long without other much fancier users of Lua, noticing)

My bet is on 2, 3 or 4, but I don't have any evidence to support that yet.

So....... thoughts? :)

Mar 11 '21 16:03 cmsj

I'm genuinely considering investigating what it would take to ship a single release with all of the malloc/zombies/sanitiser debugging features enabled, just to see if the nature of these reports changes (ideally in a way that gives us more information to go on).

Edit: Turns out our release builds have partial ASAN enabled, Main Thread Checker, and Zombie Checker, enabled already.

Mar 11 '21 16:03 cmsj

Since we're apparently already paying the (approx 2x) performance cost of the Address Sanitiser in release builds, #2760 enables some of the additional options. I think we should carry these all for a single release that lasts 1-2 weeks and see what, if any, difference it makes to the quantity/quality of the crash reports we get, then revert the change. Any objections?

Mar 11 '21 16:03 cmsj

No objections; It will take me a little longer to go through the items above and the list of crashes to see if I have any immediate insights, but as to capturing more detail in crashlogs, I'm all for seeing if this helps.

Mar 11 '21 16:03 asmagill

They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it's not possible)

Coroutines that work with the Hammerspoon modules are a pretty new thing... and even so, I doubt more than a handful of our users are using them yet. And even coroutine lua code runs on the main thread, so until a coroutine yields, which is the proper lua way of transitioning out of a coroutine, a queued event callback can't run.

Somewhere we're running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we're in the middle of a Lua->C situation and the Lua state gets confused

Unless they're using my hs._asm.extras module (which I don't hide, but I don't promote, either... it's usually where I try out oddball things to see if they are worth pursuing further in a more official place), I don't think so... unless it's a side effect of another Objective-C operation, but I'd expect that to be clearly noted in the API docs.

Once I got co-routines working, I abandoned my testing with advancing the main event loop -- it introduces too much uncertainty, in my opinion.

I'm in agreement that 3 or 4 seem most likely, though.

I've also noticed in some of my own debugging that occasionally if invalid data is passed into an Objective-C method, it might be as many as 10 levels deeper in the traceback before the actual crash -- for example passing in a NULL value to a method marked non-nullable -- it's not always the called method but something it calls (or that calls, or...) before the crash occurs. It might be worth biting the bullet and addressing all of the warnings generated when building Hammerspoon... I know most are already set to cause an error during build, but not all are.

Would it be worth renaming all of the internal.m files so that even in a minimalist crashlog it's more clear which specific module contains the crashing code? It'd be an annoying change requiring modifications to almost all module files, lua and Objective-C, but maybe its time to bite that bullet as well?

Mar 11 '21 17:03 asmagill

Couple of random thoughts while doing the dishes:

Lua does a lot of work before it calls a function, we essentially just call _pcall() - are we missing a trick there?
If we wrap debug.traceback we could capture protectedCallAndTraceback tracebacks as breadcrumbs. Not sure that this would be useful, but an extra datapoint might reveal an error pattern preceding these crashes

Mar 11 '21 19:03 cmsj

lua_pcall is supposed to do that work for us, including guaranteeing a return (rather than a long-jump) on error.

Wrapping debug.traceback would at least make it clear if the crash occurs before the callback invokes lua or after... if it's before (or during, though during would put at least one of the lua_?call functions on the stack) then there would be no breadcrumb. But does that potentially leak private info?

Mar 11 '21 20:03 asmagill

potentially it does, yes, in the sense that we'd see function names and filenames

Mar 11 '21 20:03 cmsj

So, I've checked through the codebase and there's only one place we call CFRunLoopRun() - for enumerating mice in hs.mouse and it looks like we're doing it properly (ie with a custom event loop, so it shouldn't be processing any other events).

I did notice that when we do CFRunLoopAddSource() we're not very consistent about which runloop we ask to run on - sometimes it's CFRunLoopGetMain() and sometimes it's CFRunLoopGetCurrent(). In most cases those are probably the same, since GetCurrent() returns the runloop for the current thread, but we might want to review these just to make sure.

Mar 11 '21 20:03 cmsj

Similarly we have some inconsistencies between adding source with either kCFRunLoopDefaultMode or kCFRunLoopCommonModes

Mar 11 '21 21:03 cmsj

regarding the modes, I vaguely remember uncovering an issue with the mode used for timers when I was working on a menubar replacement (development on it stalled, because we fixed some errors in ours, but I eventually want to get back to it)... my replacement allowed for changes to the menu while it was open but timer's didn't fire while the menu was showing -- eventtaps did, though, which is why I looked at the modes and noticed the difference. I'll try to dig up the specific issue number when I get back home tonight, but if I recall my finding correctly, there were no issues when I changed the timer modes during my initial tests...

A long winded way of agreeing we should standardized, but I think we want to standardize on what eventtap uses in this case -- I'll confirm tonight.

Mar 11 '21 21:03 asmagill

Food for thought... as CommandPost has a consistent Lua codebase (as in, very few, if any users modify the Lua code), it might be worth having a look at the CommandPost Sentry to see if there's any similar crashes between CommandPost and Hammerspoon?

CommandPost is pretty much always in-sync with the master Hammerspoon branch too, with the note that I generally update pods more regularly in CommandPost.

Mar 11 '21 21:03 latenitefilms

@latenitefilms interestingly, you don't seem to have the same sorts of crashes in CommandPost

Mar 11 '21 22:03 cmsj

I'm going to put out a release now, to get these updated sanitiser/assert things in the wild, and we can see what reports come in over the next week (assuming it doesn't all explode immediately!)

Mar 11 '21 22:03 cmsj

@latenitefilms interestingly, you don't seem to have the same sorts of crashes in CommandPost

Well, that is... odd, given it's the same codebase, and I'm almost using every Hammerspoon extension.

Mar 11 '21 23:03 latenitefilms

I take it as a positive sign - it likely means that people are using our API in ways we didn't think of and are triggering weirdness.

The challenge now, is to figure out what it is. If one/several of them would file GH issues, this would go a lot quicker, I suspect!

Mar 11 '21 23:03 cmsj

So there's an interesting crash report now: https://sentry.io/organizations/hammerspoon/issues/2271630958/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d

It's only happening for one user so far, but the symptoms are super weird. It looks likeeverything goes wrong at MJLua.m:881 which is trying to save references to the completion word table and evalfn, but the check in luaRef for lua_isnil() (which is actually just lua_type() fails with an invalid index. The index we ask for is -1, which suggests that somehow setup.lua is returning nothing.

@asmagill does that look right to you? I'm struggling to see how a malformed init.lua could cause this, but it certainly looks like it's possible somehow

Mar 14 '21 01:03 cmsj

@cmsj, can't fault your logic... I've come to the same conclusion. The only time I've seen something similar is when making updates to hs._coresetup or setup.lua directly, though usually that results in errors, not a shortened stack.

You could check the stack size before MJLua.m line 881 and throw an alert like we do for other setup issues, if you wanted to guard against this happening again and have a more "friendly" response... I'd also probably verify that they're the appropriate return types as well, just to catch all show-stopper possibilities.

In Skin.m, if you're concerned about the -1 index, you could wrap it with lua_absindex, but I suspect all that would do is shift where the error is triggered.

I also noted in Skin.m line 508 that if you do return early because the item to be stored is nil, it isn't removed from the stack while a successful call does... is this intentional?

Mar 14 '21 03:03 asmagill

Agreed on checking the size and types of the stack. I'll get going on that.

There are some more crash reports rolling in today, which I'll be diving into later, but I'm also starting to think about how we can encourage the (very few) people who are getting these crashes, to talk to us. I believe Sentry offers some kind of UI for users to send a message with the crash report, so I'll look into that, but if anyone else has suggestions for ways to improve the communication here, I'm all ears :)

Mar 14 '21 14:03 cmsj

Some random thoughts/questions:

Is it worth keeping a counter of how many Lua instances are created (i.e. how many reloads), or help narrow down if things are caused by an initial load of Hammerspoon, versus many reloads later?
I'm assuming this hasn't been implemented due to privacy concerns, but I'm assuming it's possible to include the last Lua command triggered (from either the Console or loaded .lua file) in the crash logs to help with fault-finding?
Is it worth removing setup.lua all together, and triggering that same code in Objective-C land? That way we can know for certain that the setup process is complete, and that _coresetup/init.lua is the first thing loaded. In theory, it would be great if Hammerspoon could still "load" even if coresetup fails for whatever reason.

Mar 15 '21 00:03 latenitefilms

Also, FYI - I can no longer access Hammerspoon's Sentry account - it just says:

Authentication error: You do not have access to the required GitHub organization.

I assume this is because I'm not a member of Hammerspoon on GitHub.

I probably don't need access, so all good - but just wanted to let you know.

Mar 15 '21 00:03 latenitefilms

Since we're apparently already paying the (approx 2x) performance cost of the Address Sanitiser in release builds

Ah, interesting... does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser?

Mar 15 '21 00:03 latenitefilms

Is it worth keeping a counter of how many Lua instances are created (i.e. how many reloads), or help narrow down if things are caused by an initial load of Hammerspoon, versus many reloads later?

It couldn't hurt!

I'm assuming this hasn't been implemented due to privacy concerns, but I'm assuming it's possible to include the last Lua command triggered (from either the Console or loaded .lua file) in the crash logs to help with fault-finding?

It's actually non-trivial - we'd essentially need to maintain a version of luaL_traceback that returns the traceback as a string rather than pushing it onto the Lua stack.

Is it worth removing setup.lua all together, and triggering that same code in Objective-C land? That way we can know for certain that the setup process is complete, and that _coresetup/init.lua is the first thing loaded. In theory, it would be great if Hammerspoon could still "load" even if coresetup fails for whatever reason.

Certainly our startup process has gotten very complex and fragile. I can definitely see an argument for separating out coresetup from the user's init.lua.

Also, FYI - I can no longer access Hammerspoon's Sentry account - it just says:

I just asked it to send you a link, but let me know if it still doesn't work. I also added you to the Hammerspoon org on github. I'm pretty sure you're well past the point where that is deserved :)

Ah, interesting... does this mean I can double the performance of Hammerspoon/CommandPost, simply by disabling the Address Sanitiser?

Turns out I was wrong, the Release scheme only has all those things enabled for Test builds. I'm going to experiment with whether I can produce a full release build with all those things enabled though - even if it's just for a week or two, I want to catch more crashers. The current release has shaken out a few more, from having Lua's assertions fully enabled.

Mar 15 '21 00:03 cmsj

@cmsj, forget my comment above re timer and modes... the issue I was referencing was #1598 and it's already been addressed. And from the discussion there, my vote is standardizing on common modes.

Mar 21 '21 11:03 asmagill

Ok, so now that 0.9.88 has been out for a few days, with the explicit Sentry events when checkLuaSkinInstance detects a UUID mismatch, I'm starting to see those explicit events: https://sentry.io/share/issue/33c1e2697b6548adb19633195ff181ca/

The log also shows that the watchers' luaSkinUUID is empty, which suggests (although not conclusively) that its __gc() was called, but somehow the OS listener wasn't removed. I now realise that I made a huge mistake by initialising luaSkinUUID to all zeros at initialisation and then setting luaSkinUUID[0] to \0 in __gc(), so I'll rework that so it's never empty (ie at startup I'll set it to one string, and at __gc() I'll set it to another, so we can tell exactly which lifecycle events have/haven't happened).

The reported events thus far have only been for hs.audiodevice and I would note that we're not checking the OSStatus return codes of AudioObjectRemovePropertyListener() so maybe somehow the OS is failing to remove all of them (in which case I guess we would have no choice but to intentionally crash).

The only other thing I have in mind is how we could run some tests that create a lot of watchers, cause a lot of events to happen, and reload the config a bunch of times, to try and provoke these crashes more directly, for debugging purposes.

Thoughts anyone? :)

Apr 29 '21 10:04 cmsj

FWIW, I personally very rarely see CommandPost crash during reloads. The main time I see it crash is after waking my laptop from sleep. It'll be running when I put the Mac to sleep, but when I wake up, it's no longer running. Sentry tells you when most recent sleep event happened, right?

Did you end up already implementing a reload counter in the Sentry logs? If not, that might be interesting to see how many times Hammerspoon is reloaded before a crash occurs.

Apr 29 '21 12:04 latenitefilms

I don't have a counter yet, and curiously, a lot of the crashes seem to be happening long after a reload (as much as a couple of hours later, in one instance).

Apr 29 '21 12:04 cmsj

Good people of the Internet, I believe I have finally root-caused at least one variant of these crashes.

I was looking through the 8 hs.audiodevice related instances of checkLuaSkinInstance failing and I noticed that in one of them, while some time had passed between starting HS and the crash, at no point had Lua been reloaded.

That ruled out the idea that these objects were leftovers from a previous Lua instance. However, as I was looking around the hs.audiodevice code, I looked at audiodevice_callback() and started thinking about what our use of dispatch_async() actually means is going to happen.

The code will run on the next iteration of the Objective C runloop, but that won't be until Lua has finished doing whatever it was already doing, and that led me to the realisation that if Lua is working on stuff that happens to trigger one of these C callbacks, it's also possible that while Lua is then still doing other things, it might decide to garbage collect the object.

This leaves you in a situation where Lua has discarded an object, but Objective C is going to do something with that object on its next runloop iteration. Since Lua doesn't seem to defensively wipe the memory used by a userdata object, it's still likely to be present enough for the callback to be able to do something with the pointer, but it quickly goes off the rails.

I'm not 100% sure if this hs.audiodevice instance is identical to the ones that had been plaguing hs.timer and hs.eventtap, but I can well believe it is, and I guess we'll find out as 0.9.88 rolls out more widely and the Sentry data starts coming in.

For reference, the reproducer is:

foo = hs.audiodevice.defaultOutputDevice()
foo:watcherCallback(function(uid, event, selector, scope, element) print(uid, " :: ", event) end)
foo:watcherStart()
foo:setVolume(80)
foo = nil
collectgarbage()
collectgarbage()

(it only works if your audio volume is something other than 80% to start with)

The :setVolume() triggers the C portion of the watcher callback, which schedules part of itself to run on the main thread via dispatch_async(), but before that happens, foo (ie the hs.audiodevice object) is forcibly garbage collected.

So, now the hard part comes:

Do we switch to dispatch_sync() or do we stick with async and rely on checkLuaInstance to protect us? @asmagill this is definitely one I'd appreciate your help with!

Apr 29 '21 23:04 cmsj

Awesome detective work!

Will leave @asmagill to discuss with you best plan of attack, as this is all very above my pay grade.

Curious... What's the disadvantage of using dispatch_sync()? Will it potentially slow things down or degrade performance? As Lua is single threaded, all the callbacks will be waiting on Lua to finish doing whatever it's doing anyway, right?

Apr 29 '21 23:04 latenitefilms

@latenitefilms it's coming up on half-midnight here, I am not sufficiently brained to reason about that right now, but I believe the original thought behind using dispatch_async() was to let the C portion of the callback complete ASAP and do the Lua portion of the callback later. For things like hs.eventtap this actually matters because the OS will start dropping callbacks that take too long to execute, and there's potentially no upper limit on how long the Lua portion might run for.

Apr 29 '21 23:04 cmsj

That make complete sense, in which case, relying on checkLuaInstance seems like a good solution?

Apr 29 '21 23:04 latenitefilms