balena-engine icon indicating copy to clipboard operation
balena-engine copied to clipboard

Ignore missing config.v2.json files on migration

Open lmbarros opened this issue 2 years ago • 11 comments

TODO: One thing we should do in this PR: if every attempt of making the migration fails, simply remove all images and containers and let the device redownload everything. (This is actually how we designed the migration, but not the observed behavior.)

We have seen some cases in which the file /var/lib/docker/containers/<UUID>/config.v2.json was not found for certain UUIDs. This causes the whole migration to fail and, worse, causes our migration cleanup code to fail.

A missing config.v2.json indicates that directory does not contain a valid container, so we should not even try to migrate them. That's what this commit does: it makes the migration ignore directories with a missing config.v2.json.

I don't think we should have directories like this in the first place, this is probably the side effect of some other issue. That said, Docker itself ignores these directories (after logging a warning) during startup, so here are just bringing the Engine in line with the standard behavior.

Resolves #301

- What I did

- How I did it

- How to verify it

- Description for the changelog

lmbarros avatar Jul 14 '22 22:07 lmbarros

An error occurred whilst building your landr site preview:

{
  "name": "Error",
  "message": "Command failed with code undefined: /usr/src/app/node_modules/gatsby/cli.js build",
  "stack": "Error: Command failed with code undefined: /usr/src/app/node_modules/gatsby/cli.js build\n    at Object.exports.run (/usr/src/app/lib/build-runner.js:257:11)\n    at async build (/usr/src/app/bot/index.js:132:19)\n    at async /usr/src/app/bot/index.js:210:25\n    at async Promise.all (index 0)\n    at async middleware (/usr/src/app/node_modules/@octokit/webhooks/dist-node/index.js:355:5)"
}

ghost avatar Jul 14 '22 22:07 ghost

Totally untested as of now, therefore keeping PR as a draft.

lmbarros avatar Jul 14 '22 22:07 lmbarros

[alexgg] This has attached https://jel.ly.fish/9aaf5f96-265c-4c52-9e59-074e973b53f8

jellyfish-bot avatar Jul 19 '22 13:07 jellyfish-bot

I added a test and some error handling but not quite ready to merge this as my tests aren't working when I run individual tests.

The migrate_test.go tests pass when I run all the tests: make test-unit

But these all fail with package errors and GOROOT/GOPATH errors so I need to sort that out. make test-unit TESTDIRS=pkg make test-unit TESTDIRS=pkg/storagemigration make test-unit TESTDIRS=pkg/storagemigration/migrate_test.go

zoobot avatar Nov 12 '22 00:11 zoobot

Hey @zoobot ! I think you need to include the ./ prefix:

make test-unit TESTDIRS=./pkg/storagemigration

lmbarros avatar Nov 14 '22 14:11 lmbarros

That . worked! :) thanks @lmbarros

zoobot avatar Nov 14 '22 17:11 zoobot

Adding this here as reference for all the changes to make this happen: https://github.com/balena-os/balena-engine/pull/168/files

And the improvement: https://jel.ly.fish/product-improvement-migrate-device-storage-driver-from-aufs-to-overlayfs-22cf335f-b1f7-44f8-853d-064681c69fd8

zoobot avatar Nov 14 '22 23:11 zoobot

Update on this @lmbarros :

Was hoping to get a rpi3 to test on, Ross is sending one. Also trying to set up QEMU to test this on for quicker results, sorting out m1 QEMU issues.

Making changes today: Don't just delete single container layer that has missing config.v2.json

On migration failure due to missing config.v2.json: Remove all overlay2 layers, aufs layers remain

zoobot avatar Nov 21 '22 17:11 zoobot

@lmbarros I made the changes to test and cleanup so it removes all the overlay2 folders/containers on rollback, instead of just removing the single layer with the missing config.v2.json.

Should we wait until I have a RP3 device or QEMU to test this on or roll it out with the current go test? Do you always test on devices first?

Thanks for looking

zoobot avatar Nov 21 '22 23:11 zoobot

Also there's a typo in the changelog entry: "Rollback to ausf..."

lmbarros avatar Nov 22 '22 14:11 lmbarros

Should we wait until I have a RP3 device or QEMU to test this on or roll it out with the current go test? Do you always test on devices first?

I often do this because I am paranoid and like the ritual, but no, not necessarily. In fact, I think it's generally more important to have a good automated test case (either here or in meta-balena) capable of ensuring that we really fixed the issue.

lmbarros avatar Nov 22 '22 14:11 lmbarros