balena-engine
balena-engine copied to clipboard
Healthdog checks time out during image pulls
Description
Running:
- Device type: Raspberry Pi 4 (using 64bit OS) (BETA)
- OS version: balenaOS 2.41.0+rev4
- Supervisor version: 10.2.2
Trying to deploy the resin-electrons ( buster-rpi4-support branch) the update gets to ~90% on the device then just restarts, this has happened 4/5 times now.
There is a balena-engine crash captured in the logs attached in the support thread.
Output of balena-engine version
:
Docker version 18.09.8-dev, build 80d443d400dcc85e87322f72866593b46bafb157
Output of balena-engine info
:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 18.09.8-dev
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan null
Log: journald json-file local
Swarm:
NodeID:
Is Manager: false
Node Address:
Runtimes: bare runc
Default Runtime: runc
Init Binary: balena-engine-init
containerd version:
runc version: b3d5036b6520060db00c81352d533a5dd919aaaf
init version: 949e6fa-dirty (expected: fec3683b971d9)
Kernel Version: 4.19.66
Operating System: balenaOS 2.41.0+rev4
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 3.406GiB
Name: 06bdf8c
ID: E6H3:W2FD:A5TN:E5K2:LYTQ:KCLO:VOAK:XRRO:RYRG:W2HM:CA5D:6ZBQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Diagnostics file is attached in the support thread.
[alexgg] This issue has attached support thread https://jel.ly.fish/fcee62fb-8f42-4cb3-af75-0056ee98fb96
related to balena-os/meta-balena#1676
we see the health-check taking longer when we simultaneously pulling the user image (during application updates), which leads to the engine being killed by healthdog, which in turn interrupts the update
@robertgzr From the log:
Dec 13 17:56:47 06bdf8c balenad[818]: goroutine 1 [chan receive, 12 minutes]:
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.(*DaemonCli).start(0x44207c68a0, 0x4420401440, 0x0, 0x0)
Dec 13 17:56:47 06bdf8c balenad[818]: /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/daemon.go:222 +0x8ac
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.runDaemon(0x4420401440, 0x442043aa00, 0x442063bbc8)
Dec 13 17:56:47 06bdf8c balenad[818]: /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:7 +0x38
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.newDaemonCommand.func1(0x4420439180, 0x442043ad00, 0x0, 0x10, 0x0, 0x0)
Dec 13 17:56:47 06bdf8c balenad[818]: /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x54
Does it mean that balena start
command hangs for 12 minutes?
Looks like this is a goroutine from the daemon. To be specific the part that blocks until the server exits: https://github.com/balena-os/balena-engine/blob/master/cmd/dockerd/daemon.go#L222
[roman-mazur] This issue has attached support thread https://jel.ly.fish/#/0465ec1c-6bfd-41de-bcef-944150d7a71a
I have a similar issue whereby building images in local mode from balena cli uses so much resource/time that the engine is killed by supervisor
Incase it adds any value, I watched balena failing to build an image and timed the healthcheck whilst it was going.
I ran the first command just after it started building the final layer (the big one), the second one a little after the first command finished (it failed because the docker engine was killed). The final layer build started at about 17:22.
root@5cae8be:~# time /usr/lib/balena/balena-healthcheck && echo 'ok'
real 1m23.806s
user 0m0.565s
sys 0m0.248s
ok
root@5cae8be:~# time /usr/lib/balena/balena-healthcheck && echo 'ok'
balena: error during connect: Post http://%2Fvar%2Frun%2Fbalena-engine.sock/v1.39/containers/create: EOF.
See 'balena run --help'.
real 3m43.380s
user 0m0.586s
sys 0m0.183s
journal logs:
Jan 27 17:18:19 5cae8be 530149f91382[760]: [api] GET /v2/version 200 - 23.318 ms
Jan 27 17:18:21 5cae8be 530149f91382[760]: [api] GET /v2/local/device-info 200 - 24.049 ms
Jan 27 17:18:26 5cae8be balenad[760]: time="2021-01-27T17:18:26.881821637Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/75e4e08ef041dbad176cb12660bf67f85eb7c8704981ed7186ba606ebfb0aca1/shim.sock debug=false pid=5979
Jan 27 17:19:06 5cae8be balenad[760]: time="2021-01-27T17:19:06.200042498Z" level=info msg="shim reaped" id=75e4e08ef041dbad176cb12660bf67f85eb7c8704981ed7186ba606ebfb0aca1
Jan 27 17:19:06 5cae8be balenad[760]: time="2021-01-27T17:19:06.204545913Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
^[[DJan 27 17:19:58 5cae8be balenad[760]: time="2021-01-27T17:19:58.546626153Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/5a34fc269ae4332b67f2c0993c14972e443760408d426989c1da5c9a2eb28d91/shim.sock debug=false pid=6138
Jan 27 17:20:12 5cae8be balenad[760]: time="2021-01-27T17:20:12.732171921Z" level=info msg="shim reaped" id=5a34fc269ae4332b67f2c0993c14972e443760408d426989c1da5c9a2eb28d91
Jan 27 17:20:12 5cae8be balenad[760]: time="2021-01-27T17:20:12.736371535Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:06 5cae8be balenad[760]: time="2021-01-27T17:21:06.792391659Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/fc6e95269d87f479b26802c5935d6c8f32462563ba0b0cf25ad30b5c74e91018/shim.sock debug=false pid=6310
Jan 27 17:21:13 5cae8be balenad[760]: time="2021-01-27T17:21:13.117077178Z" level=info msg="shim reaped" id=fc6e95269d87f479b26802c5935d6c8f32462563ba0b0cf25ad30b5c74e91018
Jan 27 17:21:13 5cae8be balenad[760]: time="2021-01-27T17:21:13.121459552Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:14 5cae8be balenad[760]: time="2021-01-27T17:21:14.401231342Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/23efa581c61686255a7d5a5fde03679520d54bc7d406717fa25a3ce68169b39a/shim.sock debug=false pid=6395
Jan 27 17:21:18 5cae8be balenad[760]: time="2021-01-27T17:21:18.786080188Z" level=info msg="shim reaped" id=23efa581c61686255a7d5a5fde03679520d54bc7d406717fa25a3ce68169b39a
Jan 27 17:21:18 5cae8be balenad[760]: time="2021-01-27T17:21:18.789211370Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:49 5cae8be balenad[760]: time="2021-01-27T17:21:49.389093294Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/18c16b4988b1371279c4255ab8f7940ff7334b95400861965ea3e825ba833cc9/shim.sock debug=false pid=6538
Jan 27 17:21:55 5cae8be balenad[760]: time="2021-01-27T17:21:55.773499713Z" level=info msg="shim reaped" id=18c16b4988b1371279c4255ab8f7940ff7334b95400861965ea3e825ba833cc9
Jan 27 17:21:55 5cae8be balenad[760]: time="2021-01-27T17:21:55.777730004Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:22:01 5cae8be 530149f91382[760]: [api] GET /v1/healthy 200 - 9.276 ms
Jan 27 17:22:04 5cae8be balenad[760]: time="2021-01-27T17:22:04.298568314Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/b3c0e3c166b87c0743adf483393e4d225dde8ed6fcb7583a10b19e4a1f5124af/shim.sock debug=false pid=6727
Jan 27 17:22:49 5cae8be balenad[760]: time="2021-01-27T17:22:49.621108012Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/deec5dfb47be1ece7f905b0643a386eea6ddb5e1d5f2f116040301e1f09fbe31/shim.sock debug=false pid=6838
Jan 27 17:22:54 5cae8be balenad[760]: time="2021-01-27T17:22:54.195652422Z" level=info msg="shim reaped" id=deec5dfb47be1ece7f905b0643a386eea6ddb5e1d5f2f116040301e1f09fbe31
Jan 27 17:22:54 5cae8be balenad[760]: time="2021-01-27T17:22:54.198851729Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:26:39 5cae8be 530149f91382[760]: [debug] Attempting container log timestamp flush...
Jan 27 17:26:39 5cae8be 530149f91382[760]: [debug] Container log timestamp flush complete
Jan 27 17:27:03 5cae8be 530149f91382[760]: [api] GET /v1/healthy 200 - 13.050 ms
Jan 27 17:27:18 5cae8be systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jan 27 17:27:18 5cae8be systemd[1]: balena.service: Killing process 760 (balenad) with signal SIGABRT.
[zwhitchcox] This issue has attached support thread https://jel.ly.fish/944b21dd-44f3-485b-8044-c8d52f4ece74
[robertgzr] This issue has attached support thread https://jel.ly.fish/991e81a6-8f6c-4c07-8822-6583e3d6e459
[phil-d-wilson] This issue has attached support thread https://jel.ly.fish/dace1da4-8685-41cc-9f37-7b25f7336248
[cmfcruz] This issue has attached support thread https://jel.ly.fish/90f5b394-c92a-4aae-9f4a-51cd20c16534
[hades32] This issue has attached support thread https://jel.ly.fish/1d09e57a-0218-4a2d-a2f5-d51aab377dd7