balenahup icon indicating copy to clipboard operation
balenahup copied to clipboard

upgrade-2.x: Allow to upgrade OS with custom image

Open quentin-burg opened this issue 3 years ago • 2 comments

Allow to update an OS with a custom image located on a DockerHub private registry by providing parameters starting by --private

quentin-burg avatar Oct 26 '21 14:10 quentin-burg

Updating back to the previous version (delta) restores the missing files. Updating (again) to the new faulty version (delta) worked.

So it appears to be part of the update process not build process.

splitice avatar Nov 09 '20 01:11 splitice

Confirmed it's not always the same files. From a different hub.

gateway.version  gateway.zip  nightlight.version  nightlight.zip  process_manager.version  process_manager.zip  processor.version  wellness.version

This one got nightlight but no wellness.

splitice avatar Nov 09 '20 02:11 splitice

Had another occurance of this today.

In this case the VPN service was interrupted during (appears to have been a IP routing error, no route to AWS IPs) the upgrade.

We ended up missing a couple files on disk (files that existed in the previous version, and the new).

I assume the VPN interruption was the cause (and not an effect) as it continued for a while after.

splitice avatar Mar 28 '22 05:03 splitice

Hey, you're saying the device (hub) pulls all files when not using deltas but in rare occurrences, the same release but downloaded via delta was missing some data (zip files).

Is this correct ? I have never heard of this happen and seems extremely unlikely but if you're saying it's happening I will be sure to look into it.

20k-ultra avatar Mar 29 '22 03:03 20k-ultra

We have seen it enough times now that we can be certain of a bug.

We have done upwards of 10k delta updates easy over our time (and seen this only 3 times that I've noticed / can recall) so rare for sure.

This is the first time I've observed something like a VPN failure during the update (that may be the cause).

splitice avatar Mar 29 '22 04:03 splitice

@splitice when this happens does it happen across your devices or just some of them?

dfunckt avatar Mar 29 '22 06:03 dfunckt

Just one (at random)

splitice avatar Mar 29 '22 07:03 splitice

If this happens again can you please reach out to support to investigate the specific device? We'd be interested in its usage patterns, update frequency, its environment, etc. It sounds less like a Supervisor/deltas issue and more like some kind of file system blip.

dfunckt avatar Mar 29 '22 07:03 dfunckt

We had this happen again.

I was able to confirm its on the client delta side as a rollback to previous version, then a re-upgrade to the target version resolved the issue.

Strangely the files that are missing are some of the largest on the disk again.

Since last time we saw this we have done probably 1K or so delta updates.

The balena side logs are free of any messages:

Downloading delta for image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Downloaded image 'registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a'
Killing service 'main sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08'

Supervisor also looks good:

[debug]   delta([main] registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a): Delta applied successfully
[event]   Event: Image downloaded {"image":{"name":"registry2.balena-cloud.com/v2/944d017e1c2f579f87bbce877dfde060@sha256:189a5926cc2c3020617e584e519663de6184bfebb2215c925d242f5d17cc0d10","appId":738415,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","serviceId":2092,"serviceName":"main","imageId":5847659,"releaseId":2410051,"commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","dependent":0}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[event]   Event: Service kill {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[info]    Reported current state to the cloud
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[event]   Event: Service exit {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Service stop {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"437832610aa0ae4a9904a0be595cdb11","releaseId":2400727}}
[debug]   Replacing container for service main because of config changes:
[debug]     Non-array fields:  {"added":{},"deleted":{},"updated":{"image":"sha256:2e5ed8af695875631e9382805f00d0fa98bc6f6995c6e0d99c5ac537215bfd08","environment":{"CONTAINER_VERSION":"v0.1.0-build+c79074772f07e1a1cdecc1c1fd4630c7052298f0"}}}
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Image removal {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[event]   Event: Image removed {"image":{"name":"registry2.balena-cloud.com/v2/65ef50a0acb7d255fa0c877896894756@sha256:dc10f11c8b900f30b1e4a30192d79917e0b53326fdcd8245ebb091b5f533934a","appId":738415,"serviceId":2092,"serviceName":"main","imageId":5817405,"releaseId":2400727,"dependent":0,"appUuid":"00fa17057f6c46ec987cabdf90fbe8ef","commit":"437832610aa0ae4a9904a0be595cdb11"}}
[info]    Reported current state to the cloud
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[warn]    Failed to read splash image: Error: ENOENT: no such file or directory, open '/mnt/root/mnt/data/splash/balena-logo-default.png'
[warn]    Could not initialise splash image backend Error: ENOENT: no such file or directory, scandir '/mnt/root/mnt/data/splash'
[event]   Event: Service install {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service installed {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service start {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}
[event]   Event: Service started {"service":{"appId":738415,"serviceId":2092,"serviceName":"main","commit":"96ab71d67aa456b9a555cc0d8d1ff2dc","releaseId":2410051}}

Nothing else that I can see of relevance in journalctl

splitice avatar Dec 14 '22 03:12 splitice

We can also confirm that the files are in the image as we have a ls -las in the build script to confirm it (these specific files have gone missing from delta updates before).

splitice avatar Dec 14 '22 03:12 splitice

@splitice Thanks for continuing to update this. Do you have a reproduction that we could use to debug?

cywang117 avatar Dec 19 '22 20:12 cywang117

@cywang117 this has been a long term troubleshooting issue for us. It's so infrequent that its quite difficult to test for.

Today we had a fresh installation (first install) that was missing a valid /etc/passwd the file contained what I expect is the base layers default (missing all the users we add in our layer). This was after a device power cycle on the first install.

This got me thinking if the issue might be more related to balena engine.

So what I did is took away the devices internet access. From this point I used serial. No re-pulling images or anything from the balena container registry. And I deleted the container with balena rm.

Balena supervisor re-created the container and it had all the correct files. So the image is correct! it's just the container that is wrong.

My thoughts for potential causes:

  1. Some sort of sync bug. Perhaps a filesystem sync is needed at the end of the image extraction?
  2. Perhaps there is an issue with the image extraction? Is this different for delta images than regular images?

If you agree with my finger pointing towards balena engine I'm happy to move this ticket there.

splitice avatar Jan 27 '23 02:01 splitice

Still actively trying to find this issue when it occurs (it happened again today). Its incredibly rare (between today and January we have done hundreds of container updates, mostly delta)

This time the file that was incorrect was /etc/passwd, specifically the /etc/passwd from a parent layer was found in the place of the /etc/passwd present in the image. This is a hint to us as in the past the file was always missing (but it would also not be present in parent layers)

I was able to verify the image was correct, I exported the image from the hostos using balena image save. When inspected the layer tar (only one in this image) contained the correct /etc/passwd. So where did this other /etc/passwd (which looks like the default for a debian image) come from, its not in the image produced by the balena builder (we use a build & scratch image to create a single layer image)?

I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.

Happy to do something to authorize balena to take a look.

Release: e2a4743cfcf45f32154d0a080746e457 (id: 2672092) and the balena fleet device was upgraded using the delta from 3b28561d13a260dd935ce13364a96637

What I think is that in very rare circumstances the build time delta is wrong. If I had more time today I would take a look at the balena-supervisor delta download API and see if I could get the delta file (and check its contents).

splitice avatar Aug 01 '23 11:08 splitice

I'm starting to think that this is an issue with delta creation rather than on the device side due to the seeming impossibility of this file being introduced any other way.

If this does not happen consistently across all devices that update between the same releases, it is impossible to be an issue with deltas generation.


What Supervisor version(s) did you hit this with most recently? Can you share the fleet ID and enable support access so I can take a look myself?

dfunckt avatar Aug 01 '23 11:08 dfunckt

I was able to verify the image was correct, I exported the image from the hostos using balena image save. [...]

Which image did you export? The one from the new release after the delta has been applied?

dfunckt avatar Aug 01 '23 11:08 dfunckt

@dfunckt I can't reliably attest to the state of other devices (I didnt check, and I'm not sure there are most within the team pick and choose their updates). In the past though it seemed to only be a single device failing.

Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. We keep our BSP within a year when possible. We are currently behind and this is something we are currently behind in and intend to rectify soon.

I exported the image the container was running. The device in question (and the output below) has since been rolled forward to a new image (which cleared the issue for the device) but I exported the image id as indicated by both the delta-* and latest tags (same id)

registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df   delta-1f059135aaa2521e   24399faeb9b2        28 minutes ago      354MB
registry2.balena-cloud.com/v2/7dc307cd16e6f24ebfd08424056304df   latest                   24399faeb9b2        28 minutes ago      354MB

Fleet ID is 738415 and support access is enabled.

Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql and halley) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).

I guess it could be corruption between different images (say balena supervisor, the only other image in our engine). Perhaps I should have saved the contents of the incorrect file for comparison (sorry did not think of that).

splitice avatar Aug 01 '23 11:08 splitice

Supervisor version is a bit on the old side (14.0.8) however this issue has been fairly consistent for years. [...]

I'm only interested in the Supervisor version to infer whether the device is using balena-engine deltas or the older rsync-based implementation. The Supervisor is doing nothing more than just pulling an image to apply deltas -- everything happens within the engine.

I exported the image id as indicated by both the delta-* and latest tags (same id)

You're saying that the same exact image when exported has all the files as expected, but when used to start a container from it's missing some files. This suggests to me some kind of engine store weirdness. That's very strange. cc @lmbarros

Other than erroneus or incomplete delta generation I'm struggling to think about how an /etc/passwd without the users added within the Dockerfile (e.g postgresql and halley) could make it through to a balena device. Not only does the image contain these users there isnt likely any reason for the device itself to change the contents of this file (and limited services with permissions to modify it anyway).

If it's delta corruption then we should be able to consistently reproduce just by pulling release A image locally and then apply the delta to release B image and start a container -- we should always get a container that it can't see some of the files. But this isn't the case as I understand it.

Fleet ID is 738415 and support access is enabled.

Thanks, I'm taking a quick look.

dfunckt avatar Aug 01 '23 12:08 dfunckt

I'm exhausted after a long day but I'll see if I can rig something up to do so updates in a loop (say back and forth between these two releases). Just my device as to not interrupt the team.

This issue has been observed for too long and it's difficulty has led to it evading effort.

splitice avatar Aug 01 '23 12:08 splitice

@dfunckt Took me a bit longer than expected to rig it up due to some issues with the builders today. I wanted a clean build with limited changes to alternate between.

Got a device cycling between commit "0b3cdb96ab5642bdb1ae79c5561d3efe" with buildID "2673622" and commit "cffbb44d30062e7b56ab5f5c73fd20f1" with buildID "2674797" every 15m (30m for the complete cylce) with a quick basic health check in between.

If /etc/passwd is invalid like it was the healthcheck (http check) will fail and the loop exit. Other filesystem issues though won't be detected as checking those would be more difficult.

If this passes after a couple days I'll do a commit that changes /etc/passwd in a minor but valid way (perhaps the issue only occurs if the file is modified) say add an extra user. And test alternating between those releases.

splitice avatar Aug 02 '23 09:08 splitice

A quick look at the numbers involved.

We have done 4,327 builds since the start of our usage of balena, not every one has been successul but still alot (wow). Fleet size has averaged at around 5 in development (the production fleet gets only limited stable updates) and have likely done in excess of 20,000 delta updates. And we have only seen 3-5 filesystem failures (and only once has it been /etc/passwd), possibly more if it can go unnoticed. It could be quite difficult with my propopsed method to replicate.

Thinking about the different (observed) instances of the issue that I can remember:

  • Twice the files have been missing application bundles (zip files which are used to install content to /data if changed): changes regularly, opened on boot but not regularly touched beyond that.
  • Once /etc/passwd: rarely changes (and doesnt appear to have changed in md5sum in recent times), opened regularly

From what I can see theres the following possibilities:

  • Delta application error (say the container starts before files are completely written or something)
  • Delta generation error (seems less likely due to it not affecting all devices)
  • Engine error (no idea)
  • Overlay Filesystem error (potentialy?)
  • ext4 filesystem error (not likely, would probably just give garbled content)
  • mmc error (not likely, would probably just give garbled content)

I wonder if its something like the file (within the image) is opened and exclusively locked preventing the application of the delta?

Or perhaps something related the containers working layer and removed files.

I really don't know enough about the delta upgrade process to say.

At potentially 1/4,000 updates this is a very rare condition.

splitice avatar Aug 03 '23 04:08 splitice