xous-core
xous-core copied to clipboard
Precursor device bricked after update
Just ran an update on a precursor that hasn't been used for several months.
After update finalized, and hard reset, device boots into error from #214 and hangs.
Hard reset button doesn't work anymore. I let the device run out of power, replugged, but now it just shows a weird black screen.
Any recourse other than running failsafe?
Update log:
$ python3 tools/updater/precursorupdater/precursorupdater.py
Using SoC v0.9.7-0-ga5d09b9 registers
Phase 1: Download the update
Latest stable build selected
Kernel 100% (7926896 of 7926896) |############################################| Elapsed Time: 0:00:02 Time: 0:00:02
Loader 100% (1282152 of 1282152) |############################################| Elapsed Time: 0:00:01 Time: 0:00:01
Gateware 100% (2621440 of 2621440) |##########################################| Elapsed Time: 0:00:01 Time: 0:00:01
Embedded Controller 100% (192600 of 192600) |#################################| Elapsed Time: 0:00:00 Time: 0:00:00
WF200 100% (309328 of 309328) |###############################################| Elapsed Time: 0:00:00 Time: 0:00:00
Downloaded Xous version v0.9.12-0
Phase 2: Apply the update
Halting CPU for update.
Disabling boot by erasing loader...
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (262144 of 262144) |#############################################| Elapsed Time: 0:00:01 Time: 0:00:01
Erase finished
Uploading kernel
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (7926896 of 7926896) |###########################################| Elapsed Time: 0:00:32 Time: 0:00:32
Erase finished
Writing 100% (7926896 of 7926896) |###########################################| Elapsed Time: 0:00:43 Time: 0:00:43
Write finished
Skipped verification at user request
Uploading EC
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (192600 of 192600) |#############################################| Elapsed Time: 0:00:01 Time: 0:00:01
Erase finished
Writing 100% (192600 of 192600) |#############################################| Elapsed Time: 0:00:00 Time: 0:00:00
Write finished
Skipped verification at user request
Uploading wf200
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (309328 of 309328) |#############################################| Elapsed Time: 0:00:01 Time: 0:00:01
Erase finished
Writing 100% (309328 of 309328) |#############################################| Elapsed Time: 0:00:01 Time: 0:00:01
Write finished
Skipped verification at user request
Staging gateware
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (2621440 of 2621440) |###########################################| Elapsed Time: 0:00:10 Time: 0:00:10
Erase finished
Writing 100% (2621440 of 2621440) |###########################################| Elapsed Time: 0:00:13 Time: 0:00:13
Write finished
Skipped verification at user request
Restoring loader
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (1282152 of 1282152) |###########################################| Elapsed Time: 0:00:05 Time: 0:00:05
Erase finished
Writing 100% (1282152 of 1282152) |###########################################| Elapsed Time: 0:00:06 Time: 0:00:06
Write finished
Skipped verification at user request
Resuming CPU.
Resetting SOC...
Update finished!
Visit the QR code above to help locate the hole, or go to https://ci.betrusted.io/i/reset.jpg.
You *must* reboot by inserting a paperclip in the hole in the lower right hand side, then follow the on-device instructions.
Hmm...what version were you coming from? 0.9.7?
The SoC image should be intact -- it's not overwritten. If it's coming up with the starry night screen, that means the SoC isn't initializing.
When you said it got to the screen saying that the SoC is too old, did anything else happen to show up at the same time about running an autoupdate?
I have a feeling what might be going on is that actually in the background, an autoupdate of the SoC image and/or the EC is running, and it's covered up by the notice about the SoC being too old for the HID, and then when you reboot you're stopping right in the middle of an update step that would be dangerous to abort.
Couple of more questions, I guess...
if you plug the device in and hit the hard reset button, does it change anything?
If you press both F1+F4 while on the starry night does that change anything?
If you plug in USB, does the device enumerate as a Precursor, even if the display is borked?
I'm trying to pin down if the issue is the SoC image being corrupted, and if so, how...
Another question...did you burn any backup keys on your device? e.g. set the BBRAM key...if you don't know what I'm talking about, then you did not do that.
I just re-created a 0.9.7 device and stepped through the update process...let me know if any of these stages don't look familiar to you:
^^ version check prior to update
^^ after running precursorupdater
^^ after pressing any key to continue
^^ after waiting a few minutes
After this point I can reboot the device and it reboots normally...
Thanks for the quick response @bunnie ! I'll try to clarify whatever I can -
Not really sure what version I came from, but git reflog
says I last pulled xous-core on April 2022, so it might have been that long.
Right now device is stuck on starry night screen, thanks for letting me know that's a thing :) No dice on hard reset button. F1+F4 does nothing. Device does not enumerate on lsusb
.
When the update was running I did get the EC autoupdate screen, after which I just let it run for a while, but I don't recall seeing "Please unplug the charger cable" afterwards.
Never burned any keys on the device, so happy to run any recovery method that would be easiest.
If the system wasn't plugged in, it'll automatically reboot. So in your case it went through the update process and rebooted itself?
It might have rebooted like you said, I don't really remember if the cable was plugged in during EC autoupdate or not.
Ah ok. Do you recall what the screen said when you punched the hard reset?
The last time I got the device to reset after upgrade, I recall seeing something that resembled a proper boot sequence, but then stalled on the #214 error. I wish I had paid more attention so I could provide more details.
hmm seems the possibility is either the FPGA provisioning step hit a bug, or you might have rebooted in the middle of the update process.
If the system is just sticking on the starry night, you'll need to do a failsafe to recover it. that being said, if you get that setup I would be curious to get a snapshot of the FPGA section of the rom so I can dig through it and try to see if I can't find evidence of corruption or something.
If you're game I will prepare a command for you to run from the raspberry pi to snapshot that section in the next day or so. the resulting file will be a couple megabytes in size, you can probably attach it to GH. since you would be restoring from failsafe any keys in the fpga image would be forfeit; so I think there is no risk of disclosure publishing it, but if you do have sensitive secrets linked to the device lmk and we can exercise greater caution.
fwiw the provisioning process does make a backup copy of the fpga so in theory an additional step we could try if you have secrets you need back is restoring the backup copy. however this has never happened before so I'd have to conjure a script to do that.
I'll go ahead and prepare the failsafe environment - need to find a raspi somewhere, recently I've only been using odroids :joy:. Anyway, would love to run a snapshot if you can set me up with a script to run.
Again reiterating that I have no secrets to save, so any recovery/debug process you'd like me to run I'd be happy to.
My device was also just bricked after running the updater. After running the script, I get the screen in the first image. Then it updates the gateware. After that it resets back to the first screen. It just gets stuck in an infinite boot cycle.
Using SoC v0.9.5-0-gb6dd1f7 registers Phase 1: Download the update Latest stable build selected Kernel 100% (7926896 of 7926896) |#####################################################################################################################################################################################################################| Elapsed Time: 0:00:05 Time: 0:00:05 Loader 100% (1282152 of 1282152) |#####################################################################################################################################################################################################################| Elapsed Time: 0:00:01 Time: 0:00:01 Gateware 100% (2621440 of 2621440) |###################################################################################################################################################################################################################| Elapsed Time: 0:00:03 Time: 0:00:03 Embedded Controller 100% (192600 of 192600) |##########################################################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00 WF200 100% (309328 of 309328) |########################################################################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00 Downloaded Xous version v0.9.12-0
Phase 2: Apply the update Halting CPU for update. Disabling boot by erasing loader... ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (262144 of 262144) |######################################################################################################################################################################################################################| Elapsed Time: 0:00:01 Time: 0:00:01 Erase finished Uploading kernel ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (7926896 of 7926896) |####################################################################################################################################################################################################################| Elapsed Time: 0:00:38 Time: 0:00:38 Erase finished Writing 100% (7926896 of 7926896) |####################################################################################################################################################################################################################| Elapsed Time: 0:15:40 Time: 0:15:40 Write finished Skipped verification at user request Uploading EC ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (192600 of 192600) |######################################################################################################################################################################################################################| Elapsed Time: 0:00:03 Time: 0:00:03 Erase finished Writing 100% (192600 of 192600) |######################################################################################################################################################################################################################| Elapsed Time: 0:00:22 Time: 0:00:22 Write finished Skipped verification at user request Uploading wf200 ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (309328 of 309328) |######################################################################################################################################################################################################################| Elapsed Time: 0:00:02 Time: 0:00:02 Erase finished Writing 100% (309328 of 309328) |######################################################################################################################################################################################################################| Elapsed Time: 0:00:36 Time: 0:00:36 Write finished Skipped verification at user request Staging gateware ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (2621440 of 2621440) |####################################################################################################################################################################################################################| Elapsed Time: 0:00:12 Time: 0:00:12 Erase finished Writing 100% (2621440 of 2621440) |####################################################################################################################################################################################################################| Elapsed Time: 0:05:11 Time: 0:05:11 Write finished Skipped verification at user request Restoring loader ID code bytes 1-2: 0x8080c2c2 ID code bytes 2-3: 0x3b3b8080 Erasing 100% (1282152 of 1282152) |####################################################################################################################################################################################################################| Elapsed Time: 0:00:06 Time: 0:00:06 Erase finished Writing 100% (1282152 of 1282152) |####################################################################################################################################################################################################################| Elapsed Time: 0:02:33 Time: 0:02:33 Write finished Skipped verification at user request Resuming CPU. Resetting SOC...
Update finished!
@monological is the current status that it is able to boot (e.g. you get the half black/half gray screen) and it loops back to that over and over again, or are you looking at a static screen that doesn't change even if you punch the hard reset while the device is plugged in?
Thanks for the images...this is starting to smell like maybe there is an edge case in the e-fuse burning code that might be interfering with non-efused devices re-encrypting their gateware correctly.
@monological @yuvadm did either of you have the root keys provisioned? were you prompted to type a password during the update to authorize the provisioning?
This smells like the update process itself is running according to plan, but the gateware is not being re-encrypted properly somehow. In order for me to reproduce this I need to understand what state the devices are coming in from -- I must not be capturing an important step in the tests.
In the meantime I'm going to do a temporary override of the release train to turn off efuse support, just in case that's what is causing the trouble and potentially save other users the trouble of a bricked device.
@yuvadm here's the command:
~/code/betrusted-scripts/jtag-tools $ ./jtag_gpio.py --read --raw-binary -s -r -n -f dump.bin
I think the resulting dump.bin file should be about 2.5MiB in size so you could toss it into this thread as a file?
The staged v0.9.12 release now omits the efuse burning feature. We'll see if we get any new reports of bricked devices.
In the meantime I will pore over the code to see if there's a missing case. The problem you're describing fits a scenario where I'm not applying the FPGA encryption key incorrectly somehow. The burn-from-0.9.7 test that I ran doesn't have root keys provisioned, I'm guessing yours do have that provisioned?
The efuse burning routines interleave directly onto that code path, so I could see there being maybe a cfg
statement or something like that at the wrong place breaking one of the paths.
Nope...that's not it. Now that I'm reviewing, the code path for dealing with the alternate encryption option was there since v0.9.11. Unless nobody ran an update since then, the efuse flag is probably not the issue. I've restored the original release files, but the question remains of how did this happen. Kind of scary when a release bricks devices -- I appreciate the prompt reporting of the issue, and your continued assistance in trying to root cause this.
Hmm.. @yuvadm @monological can you please tell me if you setup the root keys on your device?
And if so, were you prompted to type in your root passphrase as part of the update process?
Thanks @bunnie I'll update further once I setup my raspi and manage to dump the firmware.
I don't recall setting up root keys, or setting a root passphrase - unless I was required to?
Once again reiterating that I haven't touched the device since ~April 22, so the efuse flag might still be a candidate bug?
Managed to dump the firmware (zipped due to github limitations) - dump.zip
... and after running provision_xous
I've managed to unbrick the device! Thanks for all the assistance @bunnie , and hopefully the dump can shed some light on what happened.
thank you! Glad to hear you can unbrick it.
I have found one thing that seems to cause troubles. If you....fat-finger the keyboard...while the update happens, it can trigger a condition that causes the gateware update to fail mid-stream, causing a situation similar to what you're seeing.
Your dump.zip matches this pattern, possibly. It's definitely not random sector corruption at the very least.
I think I have some clues to trace this down...this may actually be a bug whose root cause isn't in the update routine itself, but I think it's actually in some graphics routine optimizations that were done which interact with the updater in a bad way.
This would also explain why none of the automated testing catches this...because automated testers don't fat finger the keyboard while the update is running.
(anyone else reading this thread, to set expectations it might be a couple days before this bug is fixed)
@bunnie
@monological is the current status that it is able to boot (e.g. you get the half black/half gray screen) and it loops back to that over and over again, or are you looking at a static screen that doesn't change even if you punch the hard reset while the device is plugged in?
Yes it's able to boot. I do get the half black/half gray screen.
However, I checked the device this morning and the screen was completely off. I plugged it in and I got, what I'm assuming, is the "starry night" screen. While plugged in, if I hard reset with the pin on the side, it does nothing.
@monological @yuvadm did either of you have the root keys provisioned? were you prompted to type a password during the update to authorize the provisioning?
Nope. The root keys were never set on the device. I never had to type in a password.
... and after running
provision_xous
I've managed to unbrick the device! Thanks for all the assistance @bunnie , and hopefully the dump can shed some light on what happened.
@yuvadm Out of curiosity what/where is provision_xous
? I don’t see a script by that name… cc @bunnie
it's here:
https://github.com/betrusted-io/betrusted-scripts/blob/main/provision_xous.sh
@bunnie
@monological is the current status that it is able to boot (e.g. you get the half black/half gray screen) and it loops back to that over and over again, or are you looking at a static screen that doesn't change even if you punch the hard reset while the device is plugged in?
Yes it's able to boot. I do get the half black/half gray screen.
However, I checked the device this morning and the screen was completely off. I plugged it in and I got, what I'm assuming, is the "starry night" screen. While plugged in, if I hard reset with the pin on the side, it does nothing.
@monological @yuvadm did either of you have the root keys provisioned? were you prompted to type a password during the update to authorize the provisioning?
Nope. The root keys were never set on the device. I never had to type in a password.
thanks.
I am currently running under the theory that the issue is triggered by a bug in the UX layer that causes dynamic dialog boxes to not return correctly when faced with unexpected user input. this seems particularly likely on devices that don't have keys set up because it goes straight to updating the FPGA without prompting for user input. I was able to at least reproduce a
(speaking of fat fingering... close with comment ends up where the space bar is on a mobile phone if the virtual keyboard gets palmed)
Got a chance to poke at it more tonight...i was able to discount the fat-finger hypothesis, as i could not correlate a fat finger event definitively. The good news is I could get it to fail with the originally staged image. The even better news is the image currently available on stable does not fail.
The difference is I added a bit more margin to the cache flush pad after a FLASH write. My current best guess is this is somehow a FLASH coherence issue triggered by, somehow, the CPU responding faster than before after the kernel patches. More digging is necessary but hopefully this hotfix reduces or entirely eliminates instances of bricking in the field.
I just got a note from dolu1990 (maintainer of the Vex core) that there is a bug in the MMU that can cause D$ consistency issues.
https://github.com/SpinalHDL/VexRiscv/commit/220b599c9a820444941225c05cfa36b3b84e2bea
This is now pulled into the bleeding edge -- the problems described here are consistent with this bug. In theory I could retire the patch that adds some cache flush margin and speed up writes, but for now I'm kia si enough that I'm going to leave it in.
I did try to do some testing against the original failing version of the code to see if this definitively resolves the issue. Unfortunately, the problem became not reproduceable today for some reason. It may have to do with...temperature, or aging of the FLASH; mainly when the writes complete very fast, the problem manifests, so it's a bit of a heisenbug to nail down.
I'll still keep this open in case other people experience the issue.
Nope. The root keys were never set on the device. I never had to type in a password.
I updated to latest (non-bleeding) version last night, from whichever version shipped with the device in April 2022 and am seeing similar behavior.
I had set my root password previously, and used it successfully minutes before the update. After updating, it doesn't recognize the password, so I am unable to sign the update and it goes back into the boot loop described by @monological.
If I run the update a second time I'm able to get back into the main UI, but the password is never recognized.
Interesting. so the system is in a state where it claims there is a root password...and it can boot...but the password itself is not recognized?
To be in this state, the key box has to show "not empty", but some of the keys would need to be corrupted. However, any corruption should be detected because the whole image is hashed with an HMAC (SHA-256, so pretty strong; unlikely to be able to change something in the key box and have it result in a matching HMAC).
I'm scratching my head on this one...I can't offer a good hypothesis for why it'd happen. That being said, the bug in this thread is fixed in the bleeding-edge version, not in the latest stable. :-/