culvert
culvert copied to clipboard
Bridge disabling, wdt fixes, etc.
(This PR supersedes both #43 and #44, since I ended up adding some stuff to the wdt fixes that had dependencies on the changes in #43. If you'd prefer to merge things piecemeal that's of course fine.)
These changes stem from a pair of problems I encountered:
-
On one particular board (ASRock e3c256d4i, specifically), any attempt to use the iLPC bridge would trigger an immediate host crash -- not a culvert crash or even a kernel panic, just a complete system lockup. This motivated the first three patches (what had been in #43), which ultimately provide a way to explicitly disable certain bridges via a new command-line flag.
-
When doing a live BMC firmware reflash from the host, I've occasionally ended up with a corrupted firmware image that refused to boot. I'm pretty certain this was caused by a delay in the wdt-based reset arriving after completing the flash write sequence, which allowed the in-memory firmware to continue running for a time, during which it issued flash erases and/or writes before the reset kicked in. In debugging this I encountered a few problems:
-
When starting the wdt to initiate a reset, we weren't using the correct register-write sequence, leading to the wdt counting down from whatever value had previously been left in it, which on my systems was usually quite a bit longer than the desired duration.
-
We weren't setting the boot-source bit and instead just reused whatever it had previously been set to, meaning we weren't consistently booting from the default boot source after the reset.
-
In the wdt reset sequence we were un-gating the ARM clock immediately after enabling the watchdog (well before it fired), allowing execution to continue for approximately the duration of its countdown period.
-
In order to address these in a reasonably structured way I've made some infrastructural changes:
-
enum ahb_bridge
goes away in favor of pointers tostruct bridge_driver
(which now also contains the bridge name instead of those living in a separate array of their own) -
struct bridge driver
gains a.release()
operation mirroring the existing.reinit()
, the combination of which are used to bracket lengthy blocking waits; the p2a bridge driver (which was the sole existing implementer of.reinit()
) gains a.release()
implementation, and the debug-UART and l2a bridge drivers gain both.
Just leaving a comment here that I've tested this out on one system (AST2500) and it does resolve some issues I was having with resetting the BMC. Thanks!
...alright, dusting this off and getting back to it -- the recent update to my branch addresses the minor stuff while totally punting on the bigger, tougher questions. I haven't had any great flashes of inspiration on ways of dealing with the release/reinit problems that don't balloon into much larger tasks than I think I'm ready to grapple with right now, unfortunately.
In terms of likely real-world practical effects of it, if I'm understanding correctly it would only cause misbehavior in cases where we do a wdt reset and then also continue on to do other operations that assume that the pre-reset state is still as it was -- does that seem right? AFAICT we don't currently have any such paths (only the reset
and write firmware
commands call wdt_perform_reset()
, and both just exit right after doing so), so while it's "wrong" and certainly a bit of a potential trap we could fall into in the future it might not cause any immediate problems, so exchanging a definite bricked-BMC risk for it might be a worthwhile trade?