balena-engine icon indicating copy to clipboard operation
balena-engine copied to clipboard

Engine socket on container becomes unusable after Engine crash

Open lmbarros opened this issue 2 years ago • 7 comments

If we start a container with the label io.balena.features.balena-socket: '1' set, this container will have access to the Engine socket. However, if the Engine crashes on the Host OS, that container will no longer be able to connect to the Engine (even after the Engine restarts on the HostOS). Attempting to run Docker on the container will fail with

Cannot connect to the Docker daemon at unix:///host/run/balena-engine.sock. Is the docker daemon running?

This can be easily reproduced by SIGKILLing balenad on the Host OS and then trying to run Docker or balenaEngine on a container where it was previously working.

This is arguably on the border between the Supervisor (that sets the mounts and shares up) and the Engine (that implements the mechanisms).

lmbarros avatar Dec 16 '21 19:12 lmbarros

[lmbarros] This issue has attached support thread https://jel.ly.fish/41b56e32-5fae-4a2e-b5bb-05f9f5af1f0f

jellyfish-bot avatar Dec 16 '21 19:12 jellyfish-bot

I have an example of this issue here: https://github.com/machinemetrics/docker-socket

deanMike avatar Jan 11 '22 14:01 deanMike

Another repro courtesy of @lmbarros: https://github.com/balena-io-playground/engine-on-container-socket-lost-test

cywang117 avatar Jan 13 '22 20:01 cywang117

Did a couple more quick tests:

  • SIGKILL leaves the socket unusable in the container, as we already knew.
  • SIGABRT gives the same result as above. (This case might be of interest because that's what the watchdog sends on a timeout)
  • SIGTERM is fine, however: after the Engine restarts in the host, the socket becomes usable again in the container.

lmbarros avatar Jan 24 '22 19:01 lmbarros

I suspect this would be resolved by https://github.com/balena-os/balena-supervisor/pull/1780

klutchell avatar Feb 04 '22 18:02 klutchell

I suspect this would be resolved by balena-os/balena-supervisor#1780

@klutchell Do you know if there's still a plan to get that fix in? If there's any way me and my team could help test this out this issue has been a real thorn in our side

deanMike avatar Feb 28 '22 14:02 deanMike

Hey @deanMike, I have requested updates on the linked PR: https://github.com/balena-os/balena-supervisor/pull/1780

klutchell avatar Mar 02 '22 19:03 klutchell