magiclantern_simplified icon indicating copy to clipboard operation
magiclantern_simplified copied to clipboard

DIGIC8: Crashes related to STATE_OBJECT_HOOKS and EVF_STATE

Open kitor opened this issue 3 years ago • 4 comments

Both R and RP (untested on M50) have random crashes related to EvfCap task. Recently @coon42 got nice trace [1] that sent us into stateobj_lv_spy() from state-object.c. Disabling state objects use (implemented in a89d71f97c620dd93a2a24098b7c0de58da59445) mitigates the issue, but requires investigation in future.

stateobj_lv_spy() is replacement state transition function that we install in EVF_STATE. Quick static analysis yielded nothing, except that we might be oversimplifying. Real state transition function has a couple of checks that if I'm not mistaken - we have not implemented.

[1]

7968.524 in menu_open
8090.038 [LVEVFC] ERROR SendEventEvfDev : [12][e005edfb]
8090.081 [STARTUP] ERROR ASSERT : LiveView::EvfCapState.c
8090.106 [STARTUP] ASSERT : Task = EvfCap
8090.109 [STARTUP] ASSERT : Core 0
8090.114 [STARTUP] ASSERT : Line 370
8090.120 [STARTUP] < StackDump >
8090.123 [STARTUP] SP: 0x00213D44
(...)

[DM] FROM Write Complete!!!
     3667:  38111.302 SHUTDOWN REASON 1

kitor avatar Jan 05 '22 20:01 kitor

Looks like we still see some similar crashes on RP. Requires more in-depth testing.

kitor avatar Jan 08 '22 08:01 kitor

Possibly useful for diagnosis, a library for producing much more detailed stack information: https://www.mcternan.me.uk/ArmStackUnwinding/

Might want a separate ticket, depending on how hard it is to integrate.

reticulatedpines avatar Aug 01 '22 11:08 reticulatedpines

Confirmed on other Digic 8 models. Not tested on Digic X yet as it doesn't run LV overlays yet.

Disabling state objects use (mentioned in 1st post) did not fix the issue, it was just a fluke due to randomness / unknown condition that triggers the crash.

In general - all cases are related to some vsync callback timeout.

kitor avatar Aug 01 '22 11:08 kitor

Hmm, improved stack traces might not help much with a callback timeout. Still worth a try.

Perhaps a better debugging approach would be a minimal ML with only logging facilities (dm_set_store more stuff at level 3 or 1?), and compare with / without, to try and see what is different.

Alternatively, it may simply be we hold a lock for too long, or otherwise do too much processing in an important window for DryOS. That sounds boring but easy to investigate: pare back ML until the crash disappears, so we can work out what area we're being too demanding in. Quite plausibly our RGB / YUV buffer code.

reticulatedpines avatar Aug 01 '22 12:08 reticulatedpines