sidero icon indicating copy to clipboard operation
sidero copied to clipboard

context deadline exceeded

Open glycerin-ce opened this issue 2 years ago • 9 comments

Hi everybody. I'm trying to setup a kubernetes cluster attending the Quickstart Guide. My environment is not so open. I mean. My teslab is behinf a proxy and also can use a custom DNS resolver. But while the cluster create has an option about the resolver I cannot understand how to address about the proxy.

This is what I have once launched:

# talosctl cluster create --wait --nameservers "x.x.x.x"
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-23"
waiting for API
bootstrap error: 3 error(s) occurred:
	rpc error: code = DeadlineExceeded desc = context deadline exceeded
	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to do connect handshake, response: \"HTTP/1.1 502 Connection timed out\\r\\nConnection: close\\r\\nContent-Type: text/html\\r\\n\\r\\n<html><body><h1>502 Connection timed out</h1><p><a href='http://cntlm.sf.net/'>Cntlm</a> proxy failed to complete the request.</p></body></html>\""
	timeout
#  

taking a look at the docker ogs for the master I see:

2022-08-03T15:26:36.718572189Z time="2022-08-03T15:26:36Z" level=info msg="trying next host" error="failed to do request: Head \"https://ghcr.io/v2/siderolabs/kubelet/manifests/v1.24.2\": dial tcp 140.82.121.34:443: i/o timeout" host=ghcr.io
2022-08-03T15:26:36.719672986Z [talos] 2022/08/03 15:26:36 retrying error: failed to pull image "ghcr.io/siderolabs/kubelet:v1.24.2": failed to resolve reference "ghcr.io/siderolabs/kubelet:v1.24.2": failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/v1.24.2": dial tcp 140.82.121.34:443: i/o timeout

Caused by the lack of proxy setting.

Any hints?

Thanks in advance. Gabriele

glycerin-ce avatar Aug 03 '22 15:08 glycerin-ce

Talos needs explicit proxy settings to be set up, e.g. with a config patch.

talosctl cluster create ... -p '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "....", "https_proxy": "...."}}]

this is a tricky one, as you might want to disable http proxy for in-cluster communication

smira avatar Aug 03 '22 16:08 smira

you can also workaround that by using registry mirrors which themselves handle http proxy, while Talos pulls via the mirror: https://www.talos.dev/v1.1/talos-guides/configuration/pull-through-cache/

smira avatar Aug 03 '22 16:08 smira

Thanks for your answer @smira but the flag "-p" it is for the exposed ports. Isn't it? Instead using --config-patch flag the download seems to be ok. Nevertheless the cluster actually does not start.

Talos needs explicit proxy settings to be set up, e.g. with a config patch.

talosctl cluster create ... -p '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "....", "https_proxy": "...."}}]

this is a tricky one, as you might want to disable http proxy for in-cluster communication

glycerin-ce avatar Aug 08 '22 11:08 glycerin-ce

Thanks for your answer @smira but the flag "-p" it is for the exposed ports. Isn't it? Instead using --config-patch flag the download seems to be ok. Nevertheless the cluster actually does not start.

Yes, it should have been --config-patch. You might need an env variable no_proxy: 10.5.0.0/24 (if using default CIDRs with talosctl cluster create).

smira avatar Aug 08 '22 11:08 smira

Thanks, one again, @smira . Now it seems fine about the proxy part.

glycerin-ce avatar Aug 08 '22 13:08 glycerin-ce

Sorry @smira . Can I ask you if you can figure what is wrong.

On the logs I see these evidences:

2022-08-08T14:10:31.449290358Z [talos] 2022/08/08 14:10:31 service[etcd](Failed): Failed to run pre stage: failed to pull image "gcr.io/etcd-development/etcd:v3.5.4": 1 error(s) occurred:
2022-08-08T14:10:31.449302761Z 	failed to pull image "gcr.io/etcd-development/etcd:v3.5.4": context canceled


But this image is downloadable via docker.

What I have to debug for this problem? Thanks in advance. Best regards.

glycerin-ce avatar Aug 08 '22 14:08 glycerin-ce

These last messages are completely fine, they are printed when bootstrap aborts image pull, but it should continue going on in the background after the bootstrap.

smira avatar Aug 08 '22 18:08 smira

OK @smira but at the end the cluster isn't up. At the end it exit with this message:

`waiting for etcd to be healthy: OK
◲ waiting for etcd members to be consistent across nodes: rpc error: code = DeadlineExceeded desc = context deadline exceeded
context deadline exceeded
:~#`

And that image isn't visible (via docker command) while I search for it:

`:~# talosctl images
ghcr.io/siderolabs/flannel:v0.18.1
ghcr.io/siderolabs/install-cni:v1.1.0-2-gcb03a5d
docker.io/coredns/coredns:1.9.3
gcr.io/etcd-development/etcd:v3.5.4
k8s.gcr.io/kube-apiserver:v1.24.2
k8s.gcr.io/kube-controller-manager:v1.24.2
k8s.gcr.io/kube-scheduler:v1.24.2
k8s.gcr.io/kube-proxy:v1.24.2
ghcr.io/siderolabs/kubelet:v1.24.2
ghcr.io/siderolabs/installer:v1.1.1
k8s.gcr.io/pause:3.6
:~# docker images
REPOSITORY                 TAG       IMAGE ID       CREATED       SIZE
ghcr.io/siderolabs/talos   v1.1.1    648080035f8c   3 weeks ago   174MB
:~# `

glycerin-ce avatar Aug 08 '22 18:08 glycerin-ce

Talos doesn't use docker to pull images, the issue should be in the docker logs talos-default-master-N. it might be timing out or some other issue.

It looks like it passed the basic bootstrap, probably talosctl -n 10.5.0.2 etcd members might help

smira avatar Aug 08 '22 19:08 smira

Hi @smira Sorry if I'm giving you a feedback too late. If I try to query that node, as you suggested I receive a connection timed out. What can I do to debug this problem.

# talosctl -n 10.5.0.2 etcd members 
error getting members: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to do connect handshake, response: \"HTTP/1.1 502 Connection timed out\\r\\nConnection: close\\r\\nContent-Type: text/html\\r\\n\\r\\n<html><body><h1>502 Connection timed out</h1><p><a href='http://cntlm.sf.net/'>Cntlm</a> proxy failed to complete the request.</p></body></html>\""
# 

By the way. To create the cluster I use via CLI these parameters:

#export no_proxy="localhost,127.0.0.1,10.5.0.0/24"; talosctl cluster create --wait --dns-domain "163.162.4.70" --nameservers "163.162.4.70" --config-patch '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "http://163.162.95.56:3128", "https_proxy": "http://163.162.95.56:3128"}}]'
validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating master nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-64"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
◱ waiting for etcd members to be consistent across nodes: rpc error: code = DeadlineExceeded desc = context deadline exceeded
context deadline exceeded
# 

But at the end it seems to fail. What about? Thanks in advance

glycerin-ce avatar Sep 12 '22 16:09 glycerin-ce

Probably you need to keep no_proxy=... for the talosctl calls as well, as talosctl tries to go via your proxy. you don't want that. alternatively you can unset http_proxy, https_proxy environment variables.

smira avatar Sep 12 '22 18:09 smira

You're right @smira . Setting the env variable no_proxy I have

# talosctl -n 10.5.0.2 etcd members 
NODE       ID                 HOSTNAME                       PEER URLS               CLIENT URLS             LEARNER
10.5.0.2   c3d3020cf75b8728   talos-default-controlplane-1   https://10.5.0.2:2380   https://10.5.0.2:2379   false
# 

glycerin-ce avatar Sep 19 '22 14:09 glycerin-ce

Hi all. I've modified the talosctl cluster instauction in this way:

...
talosctl cluster create --wait --nameservers "X.Y.X.Y" --config-patch '[{"op": "add", "path": "/machine/env", "value": {"http_proxy": "http://X.X.X.X:xx", "https_proxy": "https://X.X.X.X:xx", "no_proxy": "localhost,127.0.0.1,10.5.0.0/24,0.0.0.0"}}]'
...

and the process seems to be completed. The output is this one:

~# validating CIDR and reserving IPs
generating PKI and tokens
creating network talos-default
creating controlplane nodes
creating worker nodes
renamed talosconfig context "talos-default" -> "talos-default-83"
waiting for API
bootstrapping cluster
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: OK
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: OK
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: OK

merging kubeconfig into "/root/.kube/config"
renamed cluster "talos-default" -> "talos-default-1"
renamed auth info "admin@talos-default" -> "admin@talos-default-1"
renamed context "admin@talos-default" -> "admin@talos-default-1"
PROVISIONER       docker
NAME              talos-default
NETWORK NAME      talos-default
NETWORK CIDR      10.5.0.0/24
NETWORK GATEWAY   10.5.0.1
NETWORK MTU       1500

NODES:

NAME                            TYPE           IP         CPU    RAM      DISK
/talos-default-controlplane-1   controlplane   10.5.0.2   2.00   2.1 GB   -
/talos-default-worker-1         worker         10.5.0.3   2.00   2.1 GB   -
~# 

glycerin-ce avatar Sep 19 '22 16:09 glycerin-ce

Obviously when I have to use the talosctl command to make some query/command I have to set the environment variable no_proxy. But. Is there a better way to set this variable without modifying the user behavior?

glycerin-ce avatar Sep 19 '22 17:09 glycerin-ce

Is there a better way to set this variable without modifying the user behavior?

it's a question about your environment, not really Talos.

you can configure no_proxy to skip private CIDRs, but this depends on the network environment you have.

smira avatar Sep 19 '22 19:09 smira

Yes. You're right @smira I apologize. I was asking it as an advice and not as a support.

Thanks. Gabriele

glycerin-ce avatar Sep 21 '22 08:09 glycerin-ce

okay, thanks, I'm going to close this one as we seem to have a solution.

smira avatar Sep 21 '22 10:09 smira