bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

`host-ctr` cli crashes when pulling public ECR image

Open taraspos opened this issue 1 year ago • 11 comments

host-ctr CLI crashes with panic when trying to pull any public ECR image, while private ones work fine.

Image Can pull?
328549459982.dkr.ecr.us-east-1.amazonaws.com/bottlerocket-control:v0.7.12
public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12

Image I'm using:

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.19.4 (aws-k8s-1.28)"
PRETTY_NAME="Bottlerocket OS 1.19.4 (aws-k8s-1.28)"
VARIANT_ID=aws-k8s-1.28
VERSION_ID=1.19.4
BUILD_ID=4f0a078e
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"

What I expected to happen: ECR image is successfully pulled

What actually happened:

Running host-ctr run --source public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12 --container-id test results in:

time="2024-06-17T12:25:22Z" level=info msg="Image does not exist, proceeding to pull image from source." ref="public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x557adfa3a83d]

goroutine 1 [running]:
main.withDynamicResolver({0x557ae03822d8?, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x0)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1150 +0x19d
main.pullImage({0x557ae03822d8, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x38?, {0x0?, 0xc00071d308?}, 0xc00064b1d0)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1046 +0x39e
main.fetchImage({0x557ae03822d8, 0xc0006b1200}, {0x7ffce586eebc, 0x38}, 0x557adfa46468?, {0x0, 0x0}, 0x0, 0x0?)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:1013 +0x3e7
main.runCtr({0x557adfa8c224, 0x24}, {0x557adfa46468, 0x7}, {0x7ffce586ef04, 0x4}, {0x7ffce586eebc, 0x38}, 0x0, {0x0, ...}, ...)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:299 +0x467
main.App.func1(0xc0004c4000?)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:144 +0x93
github.com/urfave/cli/v2.(*Command).Run(0xc0004c4000, 0xc0004b8d40, {0xc0004bc320, 0x5, 0x5})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/command.go:279 +0x9dd
github.com/urfave/cli/v2.(*Command).Run(0xc0004c51e0, 0xc0004b8500, {0xc0000401e0, 0x6, 0x6})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/command.go:272 +0xc2e
github.com/urfave/cli/v2.(*App).RunContext(0xc000156e00, {0x557ae0382268?, 0x557ae10db440}, {0xc0000401e0, 0x6, 0x6})
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/app.go:337 +0x5db
github.com/urfave/cli/v2.(*App).Run(...)
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/vendor/github.com/urfave/cli/v2/app.go:311
main.main()
	/home/builder/rpmbuild/BUILD/bottlerocket-host-ctr-0.0/cmd/host-ctr/main.go:60 +0x3f

https://github.com/bottlerocket-os/bottlerocket/blob/64049ba8364a0a43604ae5d6052c31f3d367dd44/sources/host-ctr/cmd/host-ctr/main.go#L1147-L1152

How to reproduce the problem:

  1. Connect to Bottlerocket node
  2. enter-admin-container
  3. sudo sheltie
  4. host-ctr run --source public.ecr.aws/bottlerocket/bottlerocket-control:v0.7.12 --container-id test

taraspos avatar Jun 17 '24 12:06 taraspos

Thanks for the report (and thanks for the very clear reproduction instructions, in particular).

larvacea avatar Jun 17 '24 15:06 larvacea

Initial triage says:

  • Yes, this reproduces as advertised on our latest release. Not a big surprise, since this code hasn't changed recently, but worth noting.
  • We may not have encountered this earlier because the default URL for this container (at least on my aws-eks variant node) points to a private repository rather than public.ecr.aws.
  • Given the code that is failing here, there's a clear expectation that this should work, and at the very least, not segfault.

larvacea avatar Jun 17 '24 15:06 larvacea

The segfault occurs because the caller has passed a null registryConfig pointer to the victim withDynamicResolver function. The solution seems simple enough (i.e., don't dereference the null pointer). Thanks again for the report.

larvacea avatar Jun 17 '24 16:06 larvacea

A little more context: the host-ctr executable is invoked by systemd services (see the boot-containers@ and host-containers@ services in package/os). In those service files the service supplies the registry-config option, so host-ctr does not segfault there. If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

larvacea avatar Jun 17 '24 17:06 larvacea

I have verified that settings.host-containers.control.source can be a public ECR URI. For production, you can set this via user data on your worker instances.

larvacea avatar Jun 17 '24 17:06 larvacea

@taraspos did @larvacea's comment resolve your issue?

If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

yeazelm avatar Jun 26 '24 14:06 yeazelm

@taraspos did @larvacea's comment resolve your issue?

If you wish to use host-ctr outside of those services, you can work around this problem by adding --registry-config /dev/null to your own invocation of host-ctr.

Hey @yeazelm, yes, using this workaround prevents host-ctr from crashing

taraspos avatar Jun 27 '24 08:06 taraspos

Awesome! Glad to hear this got you unblocked. I'll resolve this issue then.

yeazelm avatar Jun 27 '24 15:06 yeazelm

Awesome! Glad to hear this got you unblocked. I'll resolve this issue then.

I'm not sure if resolving the issue would be the right approach, even though panic in the CLI can be worked around it has to be fixed in the long term.

taraspos avatar Jun 27 '24 15:06 taraspos

I'll reopen this then to track fixing the original issue on the panic.

yeazelm avatar Jun 27 '24 17:06 yeazelm

I have a fix progressing through the pipeline. I'll keep this issue updated.

larvacea avatar Jun 27 '24 23:06 larvacea

Fixed in https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/20

bcressey avatar Mar 14 '25 20:03 bcressey