culvert icon indicating copy to clipboard operation
culvert copied to clipboard

Bridge disabling, wdt fixes, etc.

Open zevweiss opened this issue 1 year ago • 1 comments

(This PR supersedes both #43 and #44, since I ended up adding some stuff to the wdt fixes that had dependencies on the changes in #43. If you'd prefer to merge things piecemeal that's of course fine.)

These changes stem from a pair of problems I encountered:

  • On one particular board (ASRock e3c256d4i, specifically), any attempt to use the iLPC bridge would trigger an immediate host crash -- not a culvert crash or even a kernel panic, just a complete system lockup. This motivated the first three patches (what had been in #43), which ultimately provide a way to explicitly disable certain bridges via a new command-line flag.

  • When doing a live BMC firmware reflash from the host, I've occasionally ended up with a corrupted firmware image that refused to boot. I'm pretty certain this was caused by a delay in the wdt-based reset arriving after completing the flash write sequence, which allowed the in-memory firmware to continue running for a time, during which it issued flash erases and/or writes before the reset kicked in. In debugging this I encountered a few problems:

    • When starting the wdt to initiate a reset, we weren't using the correct register-write sequence, leading to the wdt counting down from whatever value had previously been left in it, which on my systems was usually quite a bit longer than the desired duration.

    • We weren't setting the boot-source bit and instead just reused whatever it had previously been set to, meaning we weren't consistently booting from the default boot source after the reset.

    • In the wdt reset sequence we were un-gating the ARM clock immediately after enabling the watchdog (well before it fired), allowing execution to continue for approximately the duration of its countdown period.

In order to address these in a reasonably structured way I've made some infrastructural changes:

  • enum ahb_bridge goes away in favor of pointers to struct bridge_driver (which now also contains the bridge name instead of those living in a separate array of their own)

  • struct bridge driver gains a .release() operation mirroring the existing .reinit(), the combination of which are used to bracket lengthy blocking waits; the p2a bridge driver (which was the sole existing implementer of .reinit()) gains a .release() implementation, and the debug-UART and l2a bridge drivers gain both.

zevweiss avatar Nov 16 '23 10:11 zevweiss

Just leaving a comment here that I've tested this out on one system (AST2500) and it does resolve some issues I was having with resetting the BMC. Thanks!

iwoloschin avatar Dec 04 '23 16:12 iwoloschin

...alright, dusting this off and getting back to it -- the recent update to my branch addresses the minor stuff while totally punting on the bigger, tougher questions. I haven't had any great flashes of inspiration on ways of dealing with the release/reinit problems that don't balloon into much larger tasks than I think I'm ready to grapple with right now, unfortunately.

In terms of likely real-world practical effects of it, if I'm understanding correctly it would only cause misbehavior in cases where we do a wdt reset and then also continue on to do other operations that assume that the pre-reset state is still as it was -- does that seem right? AFAICT we don't currently have any such paths (only the reset and write firmware commands call wdt_perform_reset(), and both just exit right after doing so), so while it's "wrong" and certainly a bit of a potential trap we could fall into in the future it might not cause any immediate problems, so exchanging a definite bricked-BMC risk for it might be a worthwhile trade?

zevweiss avatar Feb 13 '24 03:02 zevweiss