Boot control for A/B slot bootloaders
Seeing that @ashkitten was having an issue with always booting into recovery (https://github.com/NixOS/mobile-nixos/pull/104), I was reminded of this diagram from the following document: https://source.android.com/devices/tech/ota/ab/ab_implement

This might suggest that @ashkitten's partitions had been marked invalid, and as a result, it would always boot into recovery. This could also explain why I had not encountered this problem yet with a freshly flashed phone.
I've only spent a few minutes looking into how this system operates, but I think to make this system operate properly for us, we would also need to mark the partition as having successfully booted at some point.
There is an existing bootctl utility available in android here: https://android.googlesource.com/platform/system/extras/+/master/bootctl/
It, among other things, has the ability to run bootctl mark-boot-successful.
Since this operates through a HAL, the underlying implementation (for qualcomm), appears to be here:
https://android.googlesource.com/platform/hardware/qcom/bootctrl/+/refs/heads/master
And briefly skimming this it appears to hold the BOOT_SUCCESSFUL status in the attributes field of the GPT partition entry.
I can take a stab at working on this (maybe this weekend), but figured I should post this first in case there is any feedback. Is anyone aware if other projects implement this?
I guess to be certain this is even the issue, @ashkitten could you post the output of:
fastboot getvar all 2>&1 | grep slot
on the device which is having the issue?
(Though that output will not be as useful if you have flashed something else e.g. LineageOS since then.)
Let me add some notes to this issue.
A boot should be successful only if it boots passed stage-2.
The process for marking A/B slots successful should also work for depthcharge devices, so it should have enough abstraction to not be android-specific.
Our u-boot based platforms (like the WIP pinephone) will gain a boot/recovery A/B-like scheme where booting successfully is something that needs to be tracked. It would fallback to the recovery kernel/initramfs if that one fails to boot to stage-2.
sorry, i've flashed lineageos on this phone since then so i can't help with that. i can point out however that i used fastboot erase to wipe the system partition so it'd boot properly from userdata. maybe that affects it?
No worries. I was able to test this myself and I'm fairly certain this was the problem you encountered. I was able to reproduce the issue of always booting to recovery. Here are some details of what I've found:
These variables seem to represent the state of the "boot control" system: (from fastboot getvar all)
(bootloader) slot-count:2
(bootloader) current-slot:b
(bootloader) slot-retry-count:b:3
(bootloader) slot-unbootable:b:no
(bootloader) slot-successful:b:no
(bootloader) slot-retry-count:a:3
(bootloader) slot-unbootable:a:no
(bootloader) slot-successful:a:no
This was the state my phone had immediately after flashing both boot_a and boot_b with mobile-nixos. Next, I repeatedly completed a full boot into the mobile-nixos GUI and then restarted into fastboot. Each time the slot-retry-count:b variable would decrement by 1, until it reached 0. At that point, every boot would enter recovery. Also, slot-unbootable:b changed to yes at that point:
(bootloader) slot-count:2
(bootloader) current-slot:b
(bootloader) slot-retry-count:b:0
(bootloader) slot-unbootable:b:yes
(bootloader) slot-successful:b:no
(bootloader) slot-retry-count:a:3
(bootloader) slot-unbootable:a:no
(bootloader) slot-successful:a:no
The phone also never automatically switched to the a slot for me, maybe because it's the slot-successful:a state was no, and so it was treated as invalid when considering switching.
Here are a couple more notes for this issue:
systemd-boot also has "boot counters" to do something similar: (This would be nice to have in nixos)
https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT/
There is a bit of a mismatch between the systemd approach and the android a/b "boot control" approach. Systemd has counters for each boot entry, while android has counters for each slot.
Ah! This looks reproducible on walleye. 7 retry counts initially, simply rebooting takes one off the wall, 6 retry counts left on the wall.
Though with walleye this leaves it in "Slot Unbootable: Load Error" state at 0.
(Funny enough, the androidboot.mode=charger mode still relies on that possibly failing slot, and boots it succesfully!)
So I've been digging a bit because it's going to become quite a thorny issue with the newer devices that require that.
I don't think we can end up relying on the Android HAL... Though maybe we can.
But if we can't, here's some notes:
Qualcomm (GPT)
It works a bit like ChromeOS' A/B scheme.
- https://www.chromium.org/chromium-os/chromiumos-design-docs/disk-format#TOC-Selecting-the-kernel
It uses some of the bits left available on the partition entry in the GPT to store some data.
- https://github.com/LineageOS/android_hardware_qcom_bootctrl/blob/69f2d8d08699fdec49605c6b95fc06163952b6fa/boot_control.cpp#L130-L248
- https://github.com/LineageOS/android_device_google_bonito/blob/4f1b691694a1788941cad03dba95102d93437654/gpt-utils/gpt-utils.h#L65-L88
I have not yet verified that my devices use that method, but I strongly think at least my Razer Phone 2 (SDM845) does.
$ strings bootctrl.sdm845.so | grep 'Failed to get pentry/pentry_bak'
%s: Failed to get pentry/pentry_bak for %s
This strongly points towards the linked implementation. (With multiple other strings.)
And here's a first draft:
- https://gist.github.com/e9b2e917340f5085c119692a3bb04525
To mark a boot as a success:
./boot_control.rb --success
Though note that there is almost no validation made.
There shouldn't be any danger if this is run without marking on a device which does not use that scheme. Though marking flips a bit on the current boot$SUFFIX partition. If on a device implementing another scheme this is used, this would become undefined behaviour.
To implement this at a broader scale, there should be detection in place. It is unknown what kind of detection that could be used. Though I don't know what should and could be detected.
The plan will be to eventually:
- test on walleye
- make generic enough to support depthcharge
- add slot switching
- integrate all the data into recovery
We probably can assume that when stage-1 works and runs up to switch_root, the boot is successful. I know that for Android it has to also boot the system successfully, but doing so for us is problematic due to different system types requiring different methods to mark the boot as successful.
So I guess that what will happen is for devices with ab_partitions = true, we'll use an internal config option that describes the scheme used for A/B support. Then we'll implement them per SoCs, like a quirk. Hopefully we don't get different implementations on a same SoC, though if we do we can probably manage using mkDefault and such.
In #511 this was added for the qualcomm scheme on SDM845 systems, with only the intent of marking the current slot valid.
There is no support for updating to another slot, marking untested, prioritizing it, etc, yet.
But this should solve the core problem.
Keeping open since it is still an issue to resolve (and in a generic fashion, see depthcharge).
Is there anyway to do this on depthcharge. Or even somehow manually make depthcharge boot form the b_boot so i can recover a_boot
Not at this time. With depthcharge it's easier to cheat, so there hasn't been the need yet.