magiclantern_simplified
magiclantern_simplified copied to clipboard
DIGIC8: Crashes related to STATE_OBJECT_HOOKS and EVF_STATE
Both R and RP (untested on M50) have random crashes related to EvfCap task. Recently @coon42 got nice trace [1] that sent us into stateobj_lv_spy()
from state-object.c
. Disabling state objects use (implemented in a89d71f97c620dd93a2a24098b7c0de58da59445) mitigates the issue, but requires investigation in future.
stateobj_lv_spy()
is replacement state transition function that we install in EVF_STATE
. Quick static analysis yielded nothing, except that we might be oversimplifying. Real state transition function has a couple of checks that if I'm not mistaken - we have not implemented.
[1]
7968.524 in menu_open
8090.038 [LVEVFC] ERROR SendEventEvfDev : [12][e005edfb]
8090.081 [STARTUP] ERROR ASSERT : LiveView::EvfCapState.c
8090.106 [STARTUP] ASSERT : Task = EvfCap
8090.109 [STARTUP] ASSERT : Core 0
8090.114 [STARTUP] ASSERT : Line 370
8090.120 [STARTUP] < StackDump >
8090.123 [STARTUP] SP: 0x00213D44
(...)
[DM] FROM Write Complete!!!
3667: 38111.302 SHUTDOWN REASON 1
Looks like we still see some similar crashes on RP. Requires more in-depth testing.
Possibly useful for diagnosis, a library for producing much more detailed stack information: https://www.mcternan.me.uk/ArmStackUnwinding/
Might want a separate ticket, depending on how hard it is to integrate.
Confirmed on other Digic 8 models. Not tested on Digic X yet as it doesn't run LV overlays yet.
Disabling state objects use (mentioned in 1st post) did not fix the issue, it was just a fluke due to randomness / unknown condition that triggers the crash.
In general - all cases are related to some vsync callback timeout.
Hmm, improved stack traces might not help much with a callback timeout. Still worth a try.
Perhaps a better debugging approach would be a minimal ML with only logging facilities (dm_set_store more stuff at level 3 or 1?), and compare with / without, to try and see what is different.
Alternatively, it may simply be we hold a lock for too long, or otherwise do too much processing in an important window for DryOS. That sounds boring but easy to investigate: pare back ML until the crash disappears, so we can work out what area we're being too demanding in. Quite plausibly our RGB / YUV buffer code.