balena-engine icon indicating copy to clipboard operation
balena-engine copied to clipboard

Healthdog checks time out during image pulls

Open alexgg opened this issue 5 years ago • 12 comments

Description

Running:

  • Device type: Raspberry Pi 4 (using 64bit OS) (BETA)
  • OS version: balenaOS 2.41.0+rev4
  • Supervisor version: 10.2.2

Trying to deploy the resin-electrons ( buster-rpi4-support branch) the update gets to ~90% on the device then just restarts, this has happened 4/5 times now.

There is a balena-engine crash captured in the logs attached in the support thread.

Output of balena-engine version:

Docker version 18.09.8-dev, build 80d443d400dcc85e87322f72866593b46bafb157

Output of balena-engine info:

Containers: 2
 Running: 2
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 18.09.8-dev
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan null
 Log: journald json-file local
Swarm:
 NodeID:
 Is Manager: false
 Node Address:
Runtimes: bare runc
Default Runtime: runc
Init Binary: balena-engine-init
containerd version:
runc version: b3d5036b6520060db00c81352d533a5dd919aaaf
init version: 949e6fa-dirty (expected: fec3683b971d9)
Kernel Version: 4.19.66
Operating System: balenaOS 2.41.0+rev4
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 3.406GiB
Name: 06bdf8c
ID: E6H3:W2FD:A5TN:E5K2:LYTQ:KCLO:VOAK:XRRO:RYRG:W2HM:CA5D:6ZBQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Diagnostics file is attached in the support thread.

alexgg avatar Dec 13 '19 18:12 alexgg

[alexgg] This issue has attached support thread https://jel.ly.fish/fcee62fb-8f42-4cb3-af75-0056ee98fb96

balena-ci avatar Dec 13 '19 18:12 balena-ci

related to balena-os/meta-balena#1676

we see the health-check taking longer when we simultaneously pulling the user image (during application updates), which leads to the engine being killed by healthdog, which in turn interrupts the update

robertgzr avatar Dec 16 '19 12:12 robertgzr

@robertgzr From the log:

Dec 13 17:56:47 06bdf8c balenad[818]: goroutine 1 [chan receive, 12 minutes]:
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.(*DaemonCli).start(0x44207c68a0, 0x4420401440, 0x0, 0x0)
Dec 13 17:56:47 06bdf8c balenad[818]:         /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/daemon.go:222 +0x8ac
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.runDaemon(0x4420401440, 0x442043aa00, 0x442063bbc8)
Dec 13 17:56:47 06bdf8c balenad[818]:         /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:7 +0x38
Dec 13 17:56:47 06bdf8c balenad[818]: github.com/docker/docker/cmd/dockerd.newDaemonCommand.func1(0x4420439180, 0x442043ad00, 0x0, 0x10, 0x0, 0x0)
Dec 13 17:56:47 06bdf8c balenad[818]:         /yocto/resin-board/build/tmp/work/aarch64-poky-linux/balena/18.09.8-dev+git80d443d400dcc85e87322f72866593b46bafb157-r0/git/src/import/.gopath/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x54

Does it mean that balena start command hangs for 12 minutes?

roman-mazur avatar Dec 16 '19 19:12 roman-mazur

Looks like this is a goroutine from the daemon. To be specific the part that blocks until the server exits: https://github.com/balena-os/balena-engine/blob/master/cmd/dockerd/daemon.go#L222

robertgzr avatar Dec 17 '19 11:12 robertgzr

[roman-mazur] This issue has attached support thread https://jel.ly.fish/#/0465ec1c-6bfd-41de-bcef-944150d7a71a

jellyfish-bot avatar Apr 30 '20 15:04 jellyfish-bot

I have a similar issue whereby building images in local mode from balena cli uses so much resource/time that the engine is killed by supervisor

tom-boothman avatar Jan 27 '21 11:01 tom-boothman

Incase it adds any value, I watched balena failing to build an image and timed the healthcheck whilst it was going.

I ran the first command just after it started building the final layer (the big one), the second one a little after the first command finished (it failed because the docker engine was killed). The final layer build started at about 17:22.

root@5cae8be:~# time /usr/lib/balena/balena-healthcheck && echo 'ok'

real	1m23.806s
user	0m0.565s
sys	0m0.248s
ok
root@5cae8be:~# time /usr/lib/balena/balena-healthcheck && echo 'ok'
balena: error during connect: Post http://%2Fvar%2Frun%2Fbalena-engine.sock/v1.39/containers/create: EOF.
See 'balena run --help'.

real	3m43.380s
user	0m0.586s
sys	0m0.183s

journal logs:

Jan 27 17:18:19 5cae8be 530149f91382[760]: [api]     GET /v2/version 200 - 23.318 ms
Jan 27 17:18:21 5cae8be 530149f91382[760]: [api]     GET /v2/local/device-info 200 - 24.049 ms
Jan 27 17:18:26 5cae8be balenad[760]: time="2021-01-27T17:18:26.881821637Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/75e4e08ef041dbad176cb12660bf67f85eb7c8704981ed7186ba606ebfb0aca1/shim.sock debug=false pid=5979
Jan 27 17:19:06 5cae8be balenad[760]: time="2021-01-27T17:19:06.200042498Z" level=info msg="shim reaped" id=75e4e08ef041dbad176cb12660bf67f85eb7c8704981ed7186ba606ebfb0aca1
Jan 27 17:19:06 5cae8be balenad[760]: time="2021-01-27T17:19:06.204545913Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
^[[DJan 27 17:19:58 5cae8be balenad[760]: time="2021-01-27T17:19:58.546626153Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/5a34fc269ae4332b67f2c0993c14972e443760408d426989c1da5c9a2eb28d91/shim.sock debug=false pid=6138
Jan 27 17:20:12 5cae8be balenad[760]: time="2021-01-27T17:20:12.732171921Z" level=info msg="shim reaped" id=5a34fc269ae4332b67f2c0993c14972e443760408d426989c1da5c9a2eb28d91
Jan 27 17:20:12 5cae8be balenad[760]: time="2021-01-27T17:20:12.736371535Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:06 5cae8be balenad[760]: time="2021-01-27T17:21:06.792391659Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/fc6e95269d87f479b26802c5935d6c8f32462563ba0b0cf25ad30b5c74e91018/shim.sock debug=false pid=6310
Jan 27 17:21:13 5cae8be balenad[760]: time="2021-01-27T17:21:13.117077178Z" level=info msg="shim reaped" id=fc6e95269d87f479b26802c5935d6c8f32462563ba0b0cf25ad30b5c74e91018
Jan 27 17:21:13 5cae8be balenad[760]: time="2021-01-27T17:21:13.121459552Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:14 5cae8be balenad[760]: time="2021-01-27T17:21:14.401231342Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/23efa581c61686255a7d5a5fde03679520d54bc7d406717fa25a3ce68169b39a/shim.sock debug=false pid=6395
Jan 27 17:21:18 5cae8be balenad[760]: time="2021-01-27T17:21:18.786080188Z" level=info msg="shim reaped" id=23efa581c61686255a7d5a5fde03679520d54bc7d406717fa25a3ce68169b39a
Jan 27 17:21:18 5cae8be balenad[760]: time="2021-01-27T17:21:18.789211370Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:21:49 5cae8be balenad[760]: time="2021-01-27T17:21:49.389093294Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/18c16b4988b1371279c4255ab8f7940ff7334b95400861965ea3e825ba833cc9/shim.sock debug=false pid=6538
Jan 27 17:21:55 5cae8be balenad[760]: time="2021-01-27T17:21:55.773499713Z" level=info msg="shim reaped" id=18c16b4988b1371279c4255ab8f7940ff7334b95400861965ea3e825ba833cc9
Jan 27 17:21:55 5cae8be balenad[760]: time="2021-01-27T17:21:55.777730004Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:22:01 5cae8be 530149f91382[760]: [api]     GET /v1/healthy 200 - 9.276 ms
Jan 27 17:22:04 5cae8be balenad[760]: time="2021-01-27T17:22:04.298568314Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/b3c0e3c166b87c0743adf483393e4d225dde8ed6fcb7583a10b19e4a1f5124af/shim.sock debug=false pid=6727
Jan 27 17:22:49 5cae8be balenad[760]: time="2021-01-27T17:22:49.621108012Z" level=info msg="shim balena-engine-containerd-shim started" address=/containerd-shim/moby/deec5dfb47be1ece7f905b0643a386eea6ddb5e1d5f2f116040301e1f09fbe31/shim.sock debug=false pid=6838
Jan 27 17:22:54 5cae8be balenad[760]: time="2021-01-27T17:22:54.195652422Z" level=info msg="shim reaped" id=deec5dfb47be1ece7f905b0643a386eea6ddb5e1d5f2f116040301e1f09fbe31
Jan 27 17:22:54 5cae8be balenad[760]: time="2021-01-27T17:22:54.198851729Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jan 27 17:26:39 5cae8be 530149f91382[760]: [debug]   Attempting container log timestamp flush...
Jan 27 17:26:39 5cae8be 530149f91382[760]: [debug]   Container log timestamp flush complete
Jan 27 17:27:03 5cae8be 530149f91382[760]: [api]     GET /v1/healthy 200 - 13.050 ms
Jan 27 17:27:18 5cae8be systemd[1]: balena.service: Watchdog timeout (limit 6min)!
Jan 27 17:27:18 5cae8be systemd[1]: balena.service: Killing process 760 (balenad) with signal SIGABRT.

tom-boothman avatar Jan 27 '21 17:01 tom-boothman

[zwhitchcox] This issue has attached support thread https://jel.ly.fish/944b21dd-44f3-485b-8044-c8d52f4ece74

jellyfish-bot avatar Apr 28 '21 22:04 jellyfish-bot

[robertgzr] This issue has attached support thread https://jel.ly.fish/991e81a6-8f6c-4c07-8822-6583e3d6e459

jellyfish-bot avatar May 17 '21 09:05 jellyfish-bot

[phil-d-wilson] This issue has attached support thread https://jel.ly.fish/dace1da4-8685-41cc-9f37-7b25f7336248

jellyfish-bot avatar Jul 27 '21 15:07 jellyfish-bot

[cmfcruz] This issue has attached support thread https://jel.ly.fish/90f5b394-c92a-4aae-9f4a-51cd20c16534

jellyfish-bot avatar Jul 30 '21 13:07 jellyfish-bot

[hades32] This issue has attached support thread https://jel.ly.fish/1d09e57a-0218-4a2d-a2f5-d51aab377dd7

jellyfish-bot avatar Nov 26 '21 15:11 jellyfish-bot