sysbox
sysbox copied to clipboard
Sysbox kubernetes install fails behind HTTPS Proxy
When installing sysbox via kubernetes (in a Rancher 2.6 downstream cluster with k8s 1.21.10) behind a Internet HTTPs proxy following the instructions on https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md using Ubuntu 20.04 (latest) as node OS all pods on the node(s) where sysbox should be installed report the following issue during "container creation" in Rancher 2.6:
Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_cattle-node-agent-5mdcl_cattle-system_534ba7cd-b43f-4911-a1a6-4346e0d75d06_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving
Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_coredns-685d6d555d-wbm64_kube-system_8b8b549e-52e1-44ff-b825-f7c6eee340a4_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving
Is there a workaround to specify proxy information (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) to the installer (install.yml) or is this setup not supported at all if located behind an Internet proxy?
(Searched the Web but did not find a single hint so far)
@FFock, thanks for filing this issue. I would need to reproduce this one locally to identify the fix, but at first glance, I see room for the behavior that you're commenting on. Basically, this would require an extension to our sysbox-deploy daemonset. Will look at it when have a chance.
Meanwhile I tried to find a workaround, but failed so far.
Our Rancher uses RKEv1 to deploy the downstream kubernetes cluster, which should get sysbox enabled. I tried to define the proxy settings in /etc/default/docker
but this seems to be no longer supported by docker 20.10.7 or it newer supported docker API calls which are used by Rancher RKE.
So far, it seems that it would be sufficient to apply proxy environment variables (HTTPS_PROXY
and NO_PROXY
) to the kubelet
container. However, this seems to be more difficult in this combination as expected.
Please note that the docker daemon is well configured to use our Internet proxy to download docker images. That is working fine. The sysbox configuration is trying to download something from the Internet (k8s.gcr.io) directly from the kubelet container which finally affects all containers on the sysbox enabled node(s).
By looking at your initial logs, it looks like kubelet is struggling to find the pause
image, which in cri-o's setups is typically configured within its default config file: /etc/crio/crio.conf
.
If that's the case, then cri-o may be unaware of your proxy settings, so as a potential workaround I would try to adjust your cri-o service file as indicated below:
/etc/systemd/system/crio.service
[Service]
Environment=”HTTP_PROXY=http://proxy_ip:proxy_port"
Environment=”HTTPS_PROXY=http://proxy_ip:proxy_port"
Digest systemd changes and restart cri-o:
$ sudo systemctl daemon-reload
$ sudo systemctl restart crio
Let me know if that works.
I created the /etc/systemd/system/crio.service file with the config above (adapted of course), but the error is still the same:
Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_pushprox-kube-controller-manager-client-cprjg_cattle-monitoring-system_048d41c7-d208-43d2-969b-b060103b25df_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving
Pulling the k8s.gcr.io/pause:3.5 image on the node using docker pull does not fix the issue too. Is this initializing code downloading the image in a docker-in-docker environment?
The /etc/crio/crion.conf file is:
[crio]
storage_driver = "overlay"
storage_option = ["overlay.mountopt=metacopy=on"]
[crio.network]
plugin_dirs = ["/opt/cni/bin", "/home/kubernetes/bin"]
[crio.runtime]
cgroup_manager = "cgroupfs"
conmon_cgroup = "pod"
default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "SETUID", "SETGID", "SETPCAP", "SETFCAP", "NET_BIND_SERVICE", "KILL", "AUDIT_WRITE", "NET_RAW", "SYS_CHROOT", "MKNOD"]
pids_limit = 16384
[crio.runtime.runtimes]
[crio.runtime.runtimes.sysbox-runc]
allowed_annotations = ["io.kubernetes.cri-o.userns-mode"]
runtime_path = "/usr/bin/sysbox-runc"
runtime_type = "oci"
Pulling the k8s.gcr.io/pause:3.5 image on the node using docker pull does not fix the issue too. Is this initializing code downloading the image in a docker-in-docker environment?
During installation, sysbox-k8s-deploy
daemonset replaces docker-shim as the CRI utilized by the rke's kubelet component. That's to say that all the k8s control-plane components are re-instantiated through cri-o, with the exception of kubelet itself which continues to operate within a regular docker container.
Now, notice that the cri-o daemon being installed runs at the host level, so it could very well be that kubelet is fully aware of your http-proxy settings (as defined through dockerd config), while cri-o process is not. That's why i believe that we need to extend our sysbox-k8s-deploy
daemonset to parse/extract the proper http-proxy setting from the original kubelet (or dockerd) config, and generate a crio.conf
file (or containers.conf) that reflects the proper http-proxy configuration.
Coming back to your point/question above. It makes sense to me that problem is not fixed when you fetch the pause image through a docker pull
instruction, as all you're doing there is to store this image in docker's image-layer caches. This won't help those pods that cri-o is attempting to launch. Instead, you can rely on the crictl
command to do something equivalent from any of your affected k8s/rke nodes:
$ sudo crictl pull -n <workspace> k8s.gcr.io/pause:3.5
I'll have more details once that I have cycles to recreate the issue.
Ok, thank you for explaining the background. Things are getting clearer now. The problem can be indeed reproduced quite simple:
sudo crictl pull k8s.gcr.io/pause:3.5
FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving
Thus, I will now try to get the image pulled by crictl using some proxy config for it (hopefully this can be done somehow...).
Haven't looked in details but i believe the proper config file is containers.conf. You will need to add something like this within the "engine" section and then restart crio:
...
[engine]
env = [
"http_proxy=http://1.2.3.4:5678",
"https_proxy=http://1.2.3.4:5678",
"no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16",
]
...
Problem solved (with workaround):
Adding these two lines to the [Service]
section of /etc/systemd/system/crio.service
fixes the issue*:
*NO_PROXY
might be improved with CIDR notation - I do not know yet, if crio
supports it.
Environment="HTTP_PROXY=http://\<proxy\>:\<port>" "NO_PROXY=0,1,2,3,4,5,6,7,8,9,.svc,.cluster.local,localhost"
Environment="HTTPS_PROXY=http://\<proxy\>:\<port>" "NO_PROXY=0,1,2,3,4,5,6,7,8,9,.svc,.cluster.local,localhost"
To activate these settings do:
$ sudo systemctl daemon-reload
$ sudo systemctl restart crio
To verify that pulling now works:
$ sudo crictl pull k8s.gcr.io/pause:3.5
Image is up to date for k8s.gcr.io/pause@sha256:1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
Excellent @FFock. Let's leave this issue open till we implement a proper solution in our sysbox-k8s-deploy daemonset.
There is a similar issue in a non-k8 (simple docker) envionment. The workaround is the same. Adding the proxy parameter to the daemon or in the config file (~/.docker/config.json
) in my home directory (docker manual). But it is a workaround, because it enables the proxy globally for each docker container, which is not the intention.
@fhaefemeier, not sure I got that. Leaving aside the fact that the original issue deals with a k8s setup and you are referring to a standalone / non-k8s deployment, why do you say that both issues are alike? In other words, what would be the desired behavior here if we can't rely on docker's config files?
My workaround described above using /etc/systemd/system/crio.service has apparently a serious drawback. After adapting the settings further and doing
$ sudo systemctl daemon-reload
$ sudo systemctl restart crio
kubelet, kube-apiserver, kube-controller-manager, and kube-scheduler fail to start. A possible cause for this in the kube-apiserver log:
clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/run/crio/crio.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory". Reconnecting...
Any idea, what is going wrong?
@FFock, I don't see how adding two global vars in your crio.service
file can cause something like this. Looks like the crio daemon is not able to initialize for some reason.
Do you see any other relevant log that explains why crio is not properly initializing? Perhaps you added an unexpected character in the crio.service
file which systemd and/or crio are unable to understand (?).
If you haven't done it yet, I would suggest to stop crio service first (systemctl stop crio
), and then look carefully at the journalctl -f
output while doing a journalctl start crio
.
There were several kubelet processes running and their numer seemed to be corresponding on the number of times I restarted crio. These instances survived reboots and might have been the reason for the "port already in use" errors I saw in the logs too.
By installing
apt install conntrack
and killing the kubelet processes hanging around and then doing a reboot, the problems with "ports already in use" disappeared.
Starting the crio service does not show any errors that seem to be relevant. The service is running and the file /var/run/crio/crio.sock
exists on the host.
Now the only other thing I see is:
server.go:292] "Failed to run kubelet" err="failed to run Kubelet: failed to create kubelet: open /dev/kmsg: no such file or directory"
BTW, I did not add something to the /etc/systemd/system/crio.service
but tried to change the original NO_PROXY
setting by a version using CIDR addresses. Because normally, Rancher RKE has its means to let docker restart the rancher-agent and the kubernetes containers, I thought restarting them triggered by a systemctl restart crio
could have affected the setup.
Looks like I have to recreate the node, I think it is broken now and I am giving up....
We should have this (the proxy thing) fixed shortly anyways. Will ping you when done.
Having the proxy stuff fixed by sysbox would be great because I just setup a fresh cluster and did the above steps of my workaround except that I did not do a daemon-reload
and restart crio
after having changed /etc/systemd/system/crio.service
.
Instead, I restarted the node. I now get the same issue (kubelet and other k8s processes are not starting anymore) as yesterday.
Thus, the issue is reproducible, that a reboot will cause a sysbox enabled node to fail fataly with the above crio proxy settings in a Rancher 2.6.3 deployed downstream cluster.
A short note about the current status: I was not able to get my workaround running again. The node and its sysbox setup are not proper anymore. I am not able to deploy anything there. Maybe this is caused by a newer sysbox version I am using with the new cluster or something else. I do not think that it would be useful to paste recent errors here.
Instead I am now waiting for the official solution for sysbox-on-k8s-behind-internet-proxy.
@rodnymolina, are there any chances that this will be available by end-of-next week (or earlier)? We planned to organize a training that is already scheduled with the new sysbox approach and next week would be our deadline. Otherwise, we need to do that training with classic docker-in-docker.
@FFock, will look into this one tomorrow.
@rodnymolina I revisited this issue when preparing our next training environment and I hoped I could now use sysbox, but the situation got even worse. Besides this still unresolved proxy issue we now get: https://github.com/nestybox/sysbox/issues/567