Rare "unable to mount data partition" on boot
Not super easy to see what's going in this state since SSH isn't enabled. Fairly easy to repro in the new testing closet though. Touch also doesn't work in this state.
These would all help debug this: https://github.com/commaai/agnos-builder/issues/14, https://github.com/commaai/agnos-builder/issues/158, https://github.com/commaai/agnos-builder/issues/156
In this state I can login via serial.
Serial works, right, via jungle v2? It might be difficult to find which of the 6 devices is the right one, but I guess it can be taken one by one.
Some simple commands to check what happened:
dmesg | grep "mount"
journalctl | grep "mount"
My first assumption (naive one) would be that the filesystem is corrupt and fsck is disabled (second 0) in fstab:
/dev/disk/by-partlabel/userdata /data auto discard,noatime,nodiratime,nosuid,nodev,nofail 0 0
Very interesting though why touch doesn't work.
Got one connected over serial. No console and it spams this:
(openpilot) macbookair:tests adeebshihadeh$ ./som_debug.sh
Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 1633.090419] pil-q6v5-mss 4080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 1723.584538] pil-q6v5-ms080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 1814.082480] pil-q6v5-mss 4080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 1904.583378] pil-q6v5-mss 4080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 1995.081874] pil-q6v5-mss 4080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
[FAILED] Failed to start Remote Storage Service.
[ 2085.585541] pil-q6v5-mss 4080000.qcom,mss: modem: Failed to locate modem.mdt(rc:-11)
What happened after you restarted it?
Can I reproduce this somehow? When does it usually happen?
I tested activating fsck in fstab for userdata, as explained here.
After “resetting” userdata with writing COMMA_RESET, fsck fixed userdata succesfully.
If you agree on this change (activating fsck on boot for userdata), I can:
- switch comma/flash to erase userdata by flashing a very small valid ext4 partition with just "__system_reset__" on it (RESET_TRIGGER)
- undo this (again) https://github.com/commaai/agnos-builder/commit/052d991f9d48bc174b60eedd98188a6a35fd9e7e
Second thing about this issue, is that, maybe, for some reason, there was a race condition and the partition got mounted after the mountpoint check. I can invalidate this supposition if the device rebooted fine.
No, we don't want to erase and format without user action. That's an extremely risky bug to be open to.
This is super rare, so I'm planning on fixing this myself. It's hard to repro without our rack of 50+ devices.
No, we don't want to erase and format without user action. That's an extremely risky bug to be open to.
This is only related to how comma flash does userdata erase. Maybe I’m missing something.
This is super rare, so I'm planning on fixing this myself. It's hard to repro without our rack of 50+ devices.
Sure, I totally understand. I’ll try at least getting https://github.com/commaai/agnos-builder/issues/14 done.
Got one connected over serial. No console and it spams this:
“Failed to locate modem.mdt(rc:-11)” seems like /firmware didn’t get mounted, since modem.mdt is in /firmware/image/modem.mdt. So what I’m saying above might be completely unrelated to the real issue.
I just encountered this within the first hour of using a new Comma 3X w/ stock open pilot.
Steps to repro:
- First boot.
- Install openpilot stock.
- Take 2 drives.
- Software -> Uninstall
Now stuck at this screen and touch doesn't work. How should I proceed?
Simply powering off and rebooting made touch work again. Then I pressed Confirm and everything worked as expected. But it doesn't inspire confidence...