sidero context deadline exceeded

Hi everybody. I'm trying to setup a kubernetes cluster attending the Quickstart Guide. My environment is not so open. I mean. My teslab is behinf a proxy and also can use a custom DNS resolver. But while the cluster create has an option about the resolver I cannot understand how to address about the proxy.

This is what I have once launched:

# talosctl cluster create --wait --nameservers "x.x.x.x"
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-23"
waiting for API
bootstrap error: 3 error(s) occurred:
	rpc error: code = DeadlineExceeded desc = context deadline exceeded
	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to do connect handshake, response: \"HTTP/1.1 502 Connection timed out\\r\\nConnection: close\\r\\nContent-Type: text/html\\r\\n\\r\\n<html><body><h1>502 Connection timed out</h1><p><a href='http://cntlm.sf.net/'>Cntlm</a> proxy failed to complete the request.</p></body></html>\""
	timeout
#

taking a look at the docker ogs for the master I see:

2022-08-03T15:26:36.718572189Z time="2022-08-03T15:26:36Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/kubelet/manifests/v1.24.2\": dial tcp 140.82.121.34:443: i/o timeout" host=ghcr.io
2022-08-03T15:26:36.719672986Z [talos] 2022/08/03 15:26:36 retrying error: failed to pull image "ghcr.io/siderolabs/kubelet:v1.24.2": failed to resolve reference "ghcr.io/siderolabs/kubelet:v1.24.2": failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/v1.24.2": dial tcp 140.82.121.34:443: i/o timeout

Caused by the lack of proxy setting.

Any hints?

Thanks in advance. Gabriele

Aug 03 '22 15:08 glycerin-ce

Talos needs explicit proxy settings to be set up, e.g. with a config patch.

talosctl cluster create ... -p '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "....", "https_proxy": "...."}}]

this is a tricky one, as you might want to disable http proxy for in-cluster communication

Aug 03 '22 16:08 smira

you can also workaround that by using registry mirrors which themselves handle http proxy, while Talos pulls via the mirror: https://www.talos.dev/v1.1/talos-guides/configuration/pull-through-cache/

Aug 03 '22 16:08 smira

Thanks for your answer @smira but the flag "-p" it is for the exposed ports. Isn't it? Instead using --config-patch flag the download seems to be ok. Nevertheless the cluster actually does not start.

Talos needs explicit proxy settings to be set up, e.g. with a config patch.
talosctl cluster create ... -p '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "....", "https_proxy": "...."}}]
this is a tricky one, as you might want to disable http proxy for in-cluster communication

Aug 08 '22 11:08 glycerin-ce

Thanks for your answer @smira but the flag "-p" it is for the exposed ports. Isn't it? Instead using --config-patch flag the download seems to be ok. Nevertheless the cluster actually does not start.

Yes, it should have been --config-patch. You might need an env variable no_proxy: 10.5.0.0/24 (if using default CIDRs with talosctl cluster create).

Aug 08 '22 11:08 smira

Thanks, one again, @smira . Now it seems fine about the proxy part.

Aug 08 '22 13:08 glycerin-ce

Sorry @smira . Can I ask you if you can figure what is wrong.

On the logs I see these evidences:

2022-08-08T14:10:31.449290358Z [talos] 2022/08/08 14:10:31 service[etcd](Failed): Failed to run pre stage: failed to pull image "gcr.io/etcd-development/etcd:v3.5.4": 1 error(s) occurred:
2022-08-08T14:10:31.449302761Z 	failed to pull image "gcr.io/etcd-development/etcd:v3.5.4": context canceled

But this image is downloadable via docker.

What I have to debug for this problem? Thanks in advance. Best regards.

Aug 08 '22 14:08 glycerin-ce

These last messages are completely fine, they are printed when bootstrap aborts image pull, but it should continue going on in the background after the bootstrap.

Aug 08 '22 18:08 smira

OK @smira but at the end the cluster isn't up. At the end it exit with this message:

`waiting for etcd to be healthy: OK
◲ waiting for etcd members to be consistent across nodes: rpc error: code = DeadlineExceeded desc = context deadline exceeded
context deadline exceeded
:~#`

And that image isn't visible (via docker command) while I search for it:

`:~# talosctl images
ghcr.io/siderolabs/flannel:v0.18.1
ghcr.io/siderolabs/install-cni:v1.1.0-2-gcb03a5d
docker.io/coredns/coredns:1.9.3
gcr.io/etcd-development/etcd:v3.5.4
k8s.gcr.io/kube-apiserver:v1.24.2
k8s.gcr.io/kube-controller-manager:v1.24.2
k8s.gcr.io/kube-scheduler:v1.24.2
k8s.gcr.io/kube-proxy:v1.24.2
ghcr.io/siderolabs/kubelet:v1.24.2
ghcr.io/siderolabs/installer:v1.1.1
k8s.gcr.io/pause:3.6
:~# docker images
REPOSITORY                 TAG       IMAGE ID       CREATED       SIZE
ghcr.io/siderolabs/talos   v1.1.1    648080035f8c   3 weeks ago   174MB
:~# `

Aug 08 '22 18:08 glycerin-ce

Talos doesn't use docker to pull images, the issue should be in the docker logs talos-default-master-N. it might be timing out or some other issue.

It looks like it passed the basic bootstrap, probably talosctl -n 10.5.0.2 etcd members might help

Aug 08 '22 19:08 smira

Hi @smira Sorry if I'm giving you a feedback too late. If I try to query that node, as you suggested I receive a connection timed out. What can I do to debug this problem.

# talosctl -n 10.5.0.2 etcd members 
error getting members: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to do connect handshake, response: \"HTTP/1.1 502 Connection timed out\\r\\nConnection: close\\r\\nContent-Type: text/html\\r\\n\\r\\n<html><body><h1>502 Connection timed out</h1><p><a href='http://cntlm.sf.net/'>Cntlm</a> proxy failed to complete the request.</p></body></html>\""
#

By the way. To create the cluster I use via CLI these parameters:

#export no_proxy="localhost,127.0.0.1,10.5.0.0/24"; talosctl cluster create --wait --dns-domain "163.162.4.70" --nameservers "163.162.4.70" --config-patch '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "http://163.162.95.56:3128", "https_proxy": "http://163.162.95.56:3128"}}]'
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-64"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
◱ waiting for etcd members to be consistent across nodes: rpc error: code = DeadlineExceeded desc = context deadline exceeded
context deadline exceeded
#

But at the end it seems to fail. What about? Thanks in advance

Sep 12 '22 16:09 glycerin-ce

Probably you need to keep no_proxy=... for the talosctl calls as well, as talosctl tries to go via your proxy. you don't want that. alternatively you can unset http_proxy, https_proxy environment variables.

Sep 12 '22 18:09 smira

You're right @smira . Setting the env variable no_proxy I have

# talosctl -n 10.5.0.2 etcd members 
NODE       ID                 HOSTNAME                       PEER URLS               CLIENT URLS             LEARNER
10.5.0.2   c3d3020cf75b8728   talos-default-controlplane-1   https://10.5.0.2:2380   https://10.5.0.2:2379   false
#

Sep 19 '22 14:09 glycerin-ce

Hi all. I've modified the talosctl cluster instauction in this way:

...
talosctl cluster create --wait --nameservers "X.Y.X.Y" --config-patch '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "http://X.X.X.X:xx", "https_proxy": "https://X.X.X.X:xx", "no_proxy": "localhost,127.0.0.1,10.5.0.0/24,0.0.0.0"}}]'
...

and the process seems to be completed. The output is this one:

~# validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating controlplane nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-83"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: OK
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: OK

merging kubeconfig into "/root/.kube/config"
renamed cluster "talos-default" -> "talos-default-1"
renamed auth info "admin@talos-default" -> "admin@talos-default-1"
renamed context "admin@talos-default" -> "admin@talos-default-1"
PROVISIONER       docker
NAME              talos-default
NETWORK NAME      talos-default
NETWORK CIDR      10.5.0.0/24
NETWORK GATEWAY   10.5.0.1
NETWORK MTU       1500

NODES:

NAME                            TYPE           IP         CPU    RAM      DISK
/talos-default-controlplane-1   controlplane   10.5.0.2   2.00   2.1 GB   -
/talos-default-worker-1         worker         10.5.0.3   2.00   2.1 GB   -
~#

Sep 19 '22 16:09 glycerin-ce

Obviously when I have to use the talosctl command to make some query/command I have to set the environment variable no_proxy. But. Is there a better way to set this variable without modifying the user behavior?

Sep 19 '22 17:09 glycerin-ce

Is there a better way to set this variable without modifying the user behavior?

it's a question about your environment, not really Talos.

you can configure no_proxy to skip private CIDRs, but this depends on the network environment you have.

Sep 19 '22 19:09 smira

Yes. You're right @smira I apologize. I was asking it as an advice and not as a support.

Thanks. Gabriele

Sep 21 '22 08:09 glycerin-ce

okay, thanks, I'm going to close this one as we seem to have a solution.

Sep 21 '22 10:09 smira

sidero sidero copied to clipboard

context deadline exceeded

sidero
sidero copied to clipboard