opentitan icon indicating copy to clipboard operation
opentitan copied to clipboard

[ci] HyperDebug firmware update flakiness

Open nbdd0121 opened this issue 1 year ago • 2 comments

Description

It was discovered that a source of FPGA CI flakiness was from HyperDebug firmware update. Example: https://dev.azure.com/lowrisc/opentitan/_build/results?buildId=159507&view=logs&j=6d7ef521-d3a1-575b-de1b-a7342dcf1a8e&t=5fbf7d75-9dee-5f42-7084-95893df8eb52

The failure pattern is:

  • HyperDebug firmware is updated
  • USB disconnects (by inspecting kernel log)
  • Job failed with "no device" found
  • USB reconnects (by inspecting kernel log)

The HyperDebug USB device takes ~3 seconds to be rediscovered, which is enough for jobs to fail.

The HyperDebug firmware update code checks if the correct version is already running, but we have diverging HyperDebug version in different branches, causing frequent update to happen.

I think the following actions should be taken

  • [ ] Ensure HyperDebug firmware versions are identical across branches - address the symptom by reducing HyperDebug updates needed.
  • [ ] Maybe by default only upgrade but not downgrade HyperDebug versions? I am not sure there's compatibility guarantees @jesultra
  • [ ] In OT-lib, after firmware update, confirm the HyperDebug is alive. It appears that we have some provision (1 second wait + USB enumeration?) but it does not cover all code paths and is likely not sufficient.

nbdd0121 avatar Sep 26 '24 13:09 nbdd0121

When enhancing the HyperDebug firmware, I generally do so by adding new commands or new features, while keeping the existing ones. So I think it is reasonable to modify opentitantool to by default only upgrade and not downgrade the HyperDebug firmware. I will look into make that code change.

jesultra avatar Sep 26 '24 16:09 jesultra

The above PR should address the second and third item from the list at the top of this Issue.

jesultra avatar Sep 26 '24 21:09 jesultra

With the above PR, OpenTitanTool in the master branch will not attempt downgrading HyperDebug firmware, and will wait up to 5 seconds after upgrade for the "new" USB device to show up on the bus.

I see the CherryPick:earlgrey_es_sival label on the PR, but I am not exactly sure how cherry-pinking works on GitHub, and I do not see any separate PR to any branches. We will not get the intended effect, if some branch still has older HyperDebug firmware and does not have this RP, since then alternating CI runs of the branch and master will still downgrade and upgrade, and on the branch, OpenTitanTool will use the old 1-second timeout, and may fail.

Please let me know if I need to take action to cherry-pick the PR.

jesultra avatar Oct 01 '24 16:10 jesultra

The cherry-picking label was broken due to a repository setting misconfiguration; the issue was fixed now, and I've retriggered the label now.

nbdd0121 avatar Oct 01 '24 21:10 nbdd0121

Closing as completed, given we don't auto downgrade anymore there's no need to ensure version match across branches.

nbdd0121 avatar Nov 18 '24 16:11 nbdd0121

Reopen to track effort to sync hyperdebug versions across branches

nbdd0121 avatar Jan 18 '25 20:01 nbdd0121