balena-supervisor icon indicating copy to clipboard operation
balena-supervisor copied to clipboard

Container corruption can cause the supervisor to crash

Open pipex opened this issue 1 year ago • 1 comments

Seen on a device where balena inspect <container> returns

Error response from daemon: readlink /var/lib/docker/overlay2/l/X4CB3KC5BWUW5JOBOFRZ2IQG24: no such file or directory

This causes the getAll call to throw here https://github.com/balena-os/balena-supervisor/blob/df027248fd9536107203da67b737b2e34f63eb84/src/compose/service-manager.ts#L82

Following the call upwards, this error is obtained when doing an applyTarget https://github.com/balena-os/balena-supervisor/blob/df027248fd9536107203da67b737b2e34f63eb84/src/device-state.ts#L869 which calls getCurrentState here https://github.com/balena-os/balena-supervisor/blob/df027248fd9536107203da67b737b2e34f63eb84/src/device-state.ts#L697 which eventually calls getAll

None of the calls are wrapped in a try catch, causing the supervisor to crash as below.

Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [info]    Previous engine snapshot was not stored. Skipping cleanup.
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [debug]   Handling of local mode switch is completed
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]   Uncaught exception: Error: (HTTP code 500) server error - readlink /var/lib/docker/overlay2/l/X4CB3KC5BWUW5JOBOFRZ2IQG24: no such file or directory
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]         at /usr/src/app/dist/app.js:2:643810
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at /usr/src/app/dist/app.js:2:643742
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:2:643762)
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:2:643015)
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at IncomingMessage.emit (node:events:525:35)
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at endReadableNT (node:internal/streams/readable:1358:12)
Dec 13 19:36:37 1ed3b30 balena-supervisor[10470]: [error]       at processTicksAndRejections (node:internal/process/task_queues:83:21)
Dec 13 19:36:37 1ed3b30 systemd[1]: balena-supervisor.service: Main process exited, code=exited, status=1/FAILURE

The problem is that is hard to know what to do in this situation as this happens even before we can apply a target state, so even if the exception is caught, at best the supervisor (with the existing architecture) will have to keep looping until somebody can manually fix the device.

This is probably better though as at least the supervisor can report on the error and keep reporting the system state

pipex avatar Dec 13 '23 20:12 pipex

[pipex] This has attached https://jel.ly.fish/2b9105be-8bf3-4fad-a75e-25cc17816baf

jellyfish-bot avatar Dec 13 '23 20:12 jellyfish-bot