kind building node images requires public internet access and doesn't error if pre-pulling images fails

What happened:

The control plane did not start normally.

...
I1206 09:26:24.270020     140 round_trippers.go:553] GET https://v123-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:118
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
        _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:255
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581 
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
        _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:255
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581 
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
        _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:255
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1581

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

I have pushed the image to ghcr.io/tao12345666333/kind/node:v1.23.0-rc.1

You can just run kind create cluster --image='ghcr.io/tao12345666333/kind/node:v1.23.0-rc.1'

Anything else we need to know?:

Environment:

kind version: (use kind version): kind v0.12.0-alpha+92e01d72276af8 go1.17.3 linux/amd64
Kubernetes version: (use kubectl version): v1.23.0-rc.1
Docker version: (use docker info): 0.0.0-20211015105956-46f8c8b
OS (e.g. from /etc/os-release): Fedora 34

Dec 06 '21 09:12 tao12345666333

@tao12345666333 kind create cluster --image='ghcr.io/tao12345666333/kind/node:v1.23.0-rc.1' --name=test works fine with kind version 0.12.0-alpha+970c48c04761df on my machine

I suspect the issue isn't that kind doesn't work with Kubernetes v1.23.0-rc.1, as we've been part of Kubernetes's upstream CI.

I would guess your custom built image is suffering from: https://github.com/kubernetes-sigs/kind/issues/2493 and that this is a dupe.

Dec 06 '21 19:12 BenTheElder

kind create cluster --image='ghcr.io/tao12345666333/kind/node:v1.23.0-rc.1'
Creating cluster "kind" ...
 ✓ Ensuring node image (ghcr.io/tao12345666333/kind/node:v1.23.0-rc.1) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋
]$ kind version
kind v0.12.0-alpha+96d954b38fc601 go1.16.4 linux/amd64

Dec 06 '21 19:12 aojea

Sorry for the late reply, I created a cloud instance and it really works.

It is the same as #2493. I saw the following in the containerd log.

Dec 07 04:25:40 v123-control-plane containerd[102]: time="2021-12-07T04:25:40.655772227Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-v123-control-plane,Uid:de9c0b18fb0053cf8aa0ccc3ca47c216,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout"

Then I compared the mirror that I built with the mirror that KIND has published. The content is very different. I want to confirm the reason.

v1.22.4 image:

root@kind-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/build-image/debian-base:buster-v1.7.2
k8s.gcr.io/coredns/coredns:v1.8.4
k8s.gcr.io/etcd:3.5.0-0
k8s.gcr.io/kube-apiserver:v1.22.4
k8s.gcr.io/kube-controller-manager:v1.22.4
k8s.gcr.io/kube-proxy:v1.22.4
k8s.gcr.io/kube-scheduler:v1.22.4
k8s.gcr.io/pause:3.6

self build image:

root@v123-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/kube-apiserver:v1.23.0-rc.1
k8s.gcr.io/kube-controller-manager:v1.23.0-rc.1
k8s.gcr.io/kube-proxy:v1.23.0-rc.1
k8s.gcr.io/kube-scheduler:v1.23.0-rc.1

I also checked the build script and not found reason. https://github.com/kubernetes-sigs/kind/blob/main/hack/release/build/push-node.sh

/remove-label bug /close

Dec 07 '21 04:12 tao12345666333

@tao12345666333: The label(s) /remove-label bug cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor

In response to this:

Sorry for the late reply, I created a cloud instance and it really works.

It is the same as #2493. I saw the following in the containerd log.

Dec 07 04:25:40 v123-control-plane containerd[102]: time="2021-12-07T04:25:40.655772227Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-v123-control-plane,Uid:de9c0b18fb0053cf8aa0ccc3ca47c216,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout"

Then I compared the mirror that I built with the mirror that KIND has published. The content is very different. I want to confirm the reason.

v1.22.4 image:

root@kind-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/build-image/debian-base:buster-v1.7.2
k8s.gcr.io/coredns/coredns:v1.8.4
k8s.gcr.io/etcd:3.5.0-0
k8s.gcr.io/kube-apiserver:v1.22.4
k8s.gcr.io/kube-controller-manager:v1.22.4
k8s.gcr.io/kube-proxy:v1.22.4
k8s.gcr.io/kube-scheduler:v1.22.4
k8s.gcr.io/pause:3.6

self build image:

root@v123-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/kube-apiserver:v1.23.0-rc.1
k8s.gcr.io/kube-controller-manager:v1.23.0-rc.1
k8s.gcr.io/kube-proxy:v1.23.0-rc.1
k8s.gcr.io/kube-scheduler:v1.23.0-rc.1

I also checked the build script and not found reason. https://github.com/kubernetes-sigs/kind/blob/main/hack/release/build/push-node.sh

/remove-label bug /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 07 '21 04:12 k8s-ci-robot

@tao12345666333: Closing this issue.

In response to this:

Sorry for the late reply, I created a cloud instance and it really works.

It is the same as #2493. I saw the following in the containerd log.

Dec 07 04:25:40 v123-control-plane containerd[102]: time="2021-12-07T04:25:40.655772227Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-v123-control-plane,Uid:de9c0b18fb0053cf8aa0ccc3ca47c216,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout"

Then I compared the mirror that I built with the mirror that KIND has published. The content is very different. I want to confirm the reason.

v1.22.4 image:

root@kind-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/build-image/debian-base:buster-v1.7.2
k8s.gcr.io/coredns/coredns:v1.8.4
k8s.gcr.io/etcd:3.5.0-0
k8s.gcr.io/kube-apiserver:v1.22.4
k8s.gcr.io/kube-controller-manager:v1.22.4
k8s.gcr.io/kube-proxy:v1.22.4
k8s.gcr.io/kube-scheduler:v1.22.4
k8s.gcr.io/pause:3.6

self build image:

root@v123-control-plane:/# ctr -n k8s.io i ls  |grep k8s.gcr.io    | awk '{print $1}'
k8s.gcr.io/kube-apiserver:v1.23.0-rc.1
k8s.gcr.io/kube-controller-manager:v1.23.0-rc.1
k8s.gcr.io/kube-proxy:v1.23.0-rc.1
k8s.gcr.io/kube-scheduler:v1.23.0-rc.1

I also checked the build script and not found reason. https://github.com/kubernetes-sigs/kind/blob/main/hack/release/build/push-node.sh

/remove-label bug /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 07 '21 04:12 k8s-ci-robot

/reopen

Dec 07 '21 04:12 tao12345666333

@tao12345666333: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 07 '21 04:12 k8s-ci-robot

Okay, I know the reason, because the error handling is ignored here

https://github.com/kubernetes-sigs/kind/blob/92e01d72276af89532767f612b1ca7af479c1506/pkg/build/nodeimage/buildcontext.go#L294-L312

I wonder if we can add an option to kind build node-image, such as --strict-check . Because some users may need to build a node image that can be used completely in an offline environment (or regions where gcr.io cannot be accessed e.g. China)

Dec 07 '21 05:12 tao12345666333

Building in a completely offline environment is currently not possible without using a proxy to intercept these image pulls, we must use pulling to do cross platform, you can see more in the comments on the implementation PR.

I would prefer not to add another branching flag to plumb through here, we should just make the default behavior correct. We currently ignore errors because the command will always error on unpacking (which is expected). If a --no-unpack option is added in containerd/ctr like we did with image loading we can leverage that, or we can try to parse the error output when it fails (prefer the former ..). I never got around to adding this and we needed to unblock multi-arch support.

Dec 07 '21 19:12 BenTheElder

I don't mean to build in a completely offline environment.

Instead, it refers to building in an environment where proxy is used, but used in an offline environment. In this way, the node image that needs to be constructed contains all the required images.

(The reason for this error may be that the proxy is temporarily unavailable when building the node image)

Dec 08 '21 04:12 tao12345666333

#2493 covers the proxy issue and requires a simple fix.

For handling the error: we should not need a flag because if you added the flag right now it would always error but if we fixed the error detection (or better yet prevented the error with the no-unpack option mentioned above and in the code comments) then we still wouldn't need the flag because we could just always "strict" error. Right now it will always error on unpacking.

Dec 08 '21 05:12 BenTheElder

thank you for the explanation. 👍

Dec 08 '21 05:12 tao12345666333

I forgot to add:

We don't want it to unpack and know that will error if it tries to (which it will).

Previously I patched ctr images import to support skipping unpacking so we could stop ignoring the error when side-loading images, but currently since multi-arch support we depend on letting containerd do the pulling instead of side-loading for images that are not built when building kubernetes (pause, etcd ...). We used to pull those to the host with docker pull then get the contents loaded with docker save which solves a lot of issues neatly including proxy issues, but you cannot guarantee that a docker save is for any particular architecture, there's no flag for this / docker isn't designed for this, so instead we pull from the containerd running for the architecture directly.

We only need it to store the image layers / metadata, and defer unpacking to runtime. But the pull command doesn't have similar support to skip unpacking and I haven't had time to see about getting a feature like this upstream.

We could alternatively try to parse the error .... :/, or we could pull via some other mechanism and go back to side-loading.

We also need to do #2493 and ensure during building we plumb through proxy info to the node image build container, which should be a much simpler fix, and probably sufficient for most users impacted by this.

Dec 15 '21 23:12 BenTheElder

I believe this regression has existed in prior releases, but I'd like to see this fixed in v0.13. It should not be a difficult patch just to fix the proxy part, and I think that's the important fix.

Feb 04 '22 22:02 BenTheElder

/help this should be pretty easy to patch, the proxy info needs to be propagated to the build

Apr 22 '22 16:04 BenTheElder

@BenTheElder: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help this should be pretty easy to patch, the proxy info needs to be propagated to the build

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 22 '22 16:04 k8s-ci-robot

If no one picks this up, I can handle it.

Maybe a week later.

Apr 23 '22 13:04 tao12345666333

/assign

Apr 23 '22 15:04 tao12345666333

"doesn't error" should no longer be the case. #3162

Apr 18 '23 05:04 BenTheElder

Passing through the proxy still needs doing. Should be a relatively small patch.

Apr 19 '23 15:04 BenTheElder

@BenTheElder Very willing to help, but not sure what needs to be done here, could you please just add more descriptions if convenient :)

Jul 06 '23 08:07 z1cheng

kind kind copied to clipboard

building node images requires public internet access and doesn't error if pre-pulling images fails

Guidelines

kind
kind copied to clipboard