balenahup
balenahup copied to clipboard
upgrade-2.x: Allow to upgrade OS with custom image
Allow to update an OS with a custom image located on a DockerHub private registry by providing parameters starting by --private
Updating back to the previous version (delta) restores the missing files. Updating (again) to the new faulty version (delta) worked.
So it appears to be part of the update process not build process.
Confirmed it's not always the same files. From a different hub.
gateway.version gateway.zip nightlight.version nightlight.zip process_manager.version process_manager.zip processor.version wellness.version
This one got nightlight but no wellness.
Had another occurance of this today.
In this case the VPN service was interrupted during (appears to have been a IP routing error, no route to AWS IPs) the upgrade.
We ended up missing a couple files on disk (files that existed in the previous version, and the new).
I assume the VPN interruption was the cause (and not an effect) as it continued for a while after.
Hey, you're saying the device (hub) pulls all files when not using deltas but in rare occurrences, the same release but downloaded via delta was missing some data (zip files).
Is this correct ? I have never heard of this happen and seems extremely unlikely but if you're saying it's happening I will be sure to look into it.
We have seen it enough times now that we can be certain of a bug.
We have done upwards of 10k delta updates easy over our time (and seen this only 3 times that I've noticed / can recall) so rare for sure.
This is the first time I've observed something like a VPN failure during the update (that may be the cause).
@splitice when this happens does it happen across your devices or just some of them?
Just one (at random)
If this happens again can you please reach out to support to investigate the specific device? We'd be interested in its usage patterns, update frequency, its environment, etc. It sounds less like a Supervisor/deltas issue and more like some kind of file system blip.
We had this happen again.
I was able to confirm its on the client delta side as a rollback to previous version, then a re-upgrade to the target version resolved the issue.
Strangely the files that are missing are some of the largest on the disk again.
Since last time we saw this we have done probably 1K or so delta updates.
The balena side logs are free of any messages:
Downloading delta for image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Downloaded image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Killing service 'main sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08'
Supervisor also looks good:
[debug] delta([main] registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a): Delta applied successfully
[event] Event: Image downloaded {"image":{"name":"registry2.balena-cloud.com/v2/944d017e1c2f579f87bbce877dfde060@sha256:189a5926cc2c3020617e584e519663de6184bfebb2215c925d242f5d17cc0d10","appId":738415,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","serviceId":2092,"serviceName":"main","imageId":5847659,"releaseId":2410051,"commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","dependent":0}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[event] Event: Service kill {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[info] Reported current state to the cloud
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[event] Event: Service exit {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event] Event: Service stop {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[debug] Replacing container for service main because of config changes:
[debug] Non-array fields: {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event] Event: Image removal {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[event] Event: Image removed {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[info] Reported current state to the cloud
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn] Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn] Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event] Event: Service install {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event] Event: Service installed {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event] Event: Service start {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event] Event: Service started {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
Nothing else that I can see of relevance in journalctl
We can also confirm that the files are in the image as we have a ls -las in the build script to confirm it (these specific files have gone missing from delta updates before).
@splitice Thanks for continuing to update this. Do you have a reproduction that we could use to debug?
@cywang117 this has been a long term troubleshooting issue for us. It's so infrequent that its quite difficult to test for.
Today we had a fresh installation (first install) that was missing a valid /etc/passwd
the file contained what I expect is the base layers default (missing all the users we add in our layer). This was after a device power cycle on the first install.
This got me thinking if the issue might be more related to balena engine.
So what I did is took away the devices internet access. From this point I used serial. No re-pulling images or anything from the balena container registry. And I deleted the container with balena rm
.
Balena supervisor re-created the container and it had all the correct files. So the image is correct! it's just the container that is wrong.
My thoughts for potential causes:
- Some sort of sync bug. Perhaps a filesystem sync is needed at the end of the image extraction?
- Perhaps there is an issue with the image extraction? Is this different for delta images than regular images?
If you agree with my finger pointing towards balena engine I'm happy to move this ticket there.
Still actively trying to find this issue when it occurs (it happened again today). Its incredibly rare (between today and January we have done hundreds of container updates, mostly delta)
This time the file that was incorrect was /etc/passwd, specifically the /etc/passwd from a parent layer was found in the place of the /etc/passwd present in the image. This is a hint to us as in the past the file was always missing (but it would also not be present in parent layers)
I was able to verify the image was correct, I exported the image from the hostos using balena image save
. When inspected the layer tar (only one in this image) contained the correct /etc/passwd. So where did this other /etc/passwd
(which looks like the default for a debian image) come from, its not in the image produced by the balena builder (we use a build & scratch image to create a single layer image)?
I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.
Happy to do something to authorize balena to take a look.
Release: e2a4743cfcf45f32154d0a080746e457 (id: 2672092) and the balena fleet device was upgraded using the delta from 3b28561d13a260dd935ce13364a96637
What I think is that in very rare circumstances the build time delta is wrong. If I had more time today I would take a look at the balena-supervisor delta download API and see if I could get the delta file (and check its contents).
I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.
If this does not happen consistently across all devices that update between the same releases, it is impossible to be an issue with deltas generation.
What Supervisor version(s) did you hit this with most recently? Can you share the fleet ID and enable support access so I can take a look myself?
I was able to verify the image was correct, I exported the image from the hostos using balena image save. [...]
Which image did you export? The one from the new release after the delta has been applied?
@dfunckt I can't reliably attest to the state of other devices (I didnt check, and I'm not sure there are most within the team pick and choose their updates). In the past though it seemed to only be a single device failing.
Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. We keep our BSP within a year when possible. We are currently behind and this is something we are currently behind in and intend to rectify soon.
I exported the image the container was running. The device in question (and the output below) has since been rolled forward to a new image (which cleared the issue for the device) but I exported the image id as indicated by both the delta-*
and latest
tags (same id)
registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df delta-1f059135aaa2521e 24399faeb9b2 28 minutes ago 354MB
registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df latest 24399faeb9b2 28 minutes ago 354MB
Fleet ID is 738415 and support access is enabled.
Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql
and halley
) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).
I guess it could be corruption between different images (say balena supervisor, the only other image in our engine). Perhaps I should have saved the contents of the incorrect file for comparison (sorry did not think of that).
Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. [...]
I'm only interested in the Supervisor version to infer whether the device is using balena-engine deltas or the older rsync-based implementation. The Supervisor is doing nothing more than just pulling an image to apply deltas -- everything happens within the engine.
I exported the image id as indicated by both the
delta-*
andlatest
tags (same id)
You're saying that the same exact image when exported has all the files as expected, but when used to start a container from it's missing some files. This suggests to me some kind of engine store weirdness. That's very strange. cc @lmbarros
Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql and halley) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).
If it's delta corruption then we should be able to consistently reproduce just by pulling release A image locally and then apply the delta to release B image and start a container -- we should always get a container that it can't see some of the files. But this isn't the case as I understand it.
Fleet ID is 738415 and support access is enabled.
Thanks, I'm taking a quick look.
I'm exhausted after a long day but I'll see if I can rig something up to do so updates in a loop (say back and forth between these two releases). Just my device as to not interrupt the team.
This issue has been observed for too long and it's difficulty has led to it evading effort.
@dfunckt Took me a bit longer than expected to rig it up due to some issues with the builders today. I wanted a clean build with limited changes to alternate between.
Got a device cycling between commit "0b3cdb96ab5642bdb1ae79c5561d3efe" with buildID "2673622" and commit "cffbb44d30062e7b56ab5f5c73fd20f1" with buildID "2674797" every 15m (30m for the complete cylce) with a quick basic health check in between.
If /etc/passwd
is invalid like it was the healthcheck (http check) will fail and the loop exit. Other filesystem issues though won't be detected as checking those would be more difficult.
If this passes after a couple days I'll do a commit that changes /etc/passwd in a minor but valid way (perhaps the issue only occurs if the file is modified) say add an extra user. And test alternating between those releases.
A quick look at the numbers involved.
We have done 4,327 builds since the start of our usage of balena, not every one has been successul but still alot (wow). Fleet size has averaged at around 5 in development (the production fleet gets only limited stable updates) and have likely done in excess of 20,000 delta updates. And we have only seen 3-5 filesystem failures (and only once has it been /etc/passwd), possibly more if it can go unnoticed. It could be quite difficult with my propopsed method to replicate.
Thinking about the different (observed) instances of the issue that I can remember:
- Twice the files have been missing application bundles (zip files which are used to install content to
/data
if changed): changes regularly, opened on boot but not regularly touched beyond that. - Once
/etc/passwd
: rarely changes (and doesnt appear to have changed in md5sum in recent times), opened regularly
From what I can see theres the following possibilities:
- Delta application error (say the container starts before files are completely written or something)
- Delta generation error (seems less likely due to it not affecting all devices)
- Engine error (no idea)
- Overlay Filesystem error (potentialy?)
- ext4 filesystem error (not likely, would probably just give garbled content)
- mmc error (not likely, would probably just give garbled content)
I wonder if its something like the file (within the image) is opened and exclusively locked preventing the application of the delta?
Or perhaps something related the containers working layer and removed files.
I really don't know enough about the delta upgrade process to say.
At potentially 1/4,000 updates this is a very rare condition.