sysbox Sysbox kubernetes install fails behind HTTPS Proxy

When installing sysbox via kubernetes (in a Rancher 2.6 downstream cluster with k8s 1.21.10) behind a Internet HTTPs proxy following the instructions on https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md using Ubuntu 20.04 (latest) as node OS all pods on the node(s) where sysbox should be installed report the following issue during "container creation" in Rancher 2.6:

Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_cattle-node-agent-5mdcl_cattle-system_534ba7cd-b43f-4911-a1a6-4346e0d75d06_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving
Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_coredns-685d6d555d-wbm64_kube-system_8b8b549e-52e1-44ff-b825-f7c6eee340a4_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving

Is there a workaround to specify proxy information (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) to the installer (install.yml) or is this setup not supported at all if located behind an Internet proxy?

(Searched the Web but did not find a single hint so far)

Mar 31 '22 08:03 FFock

@FFock, thanks for filing this issue. I would need to reproduce this one locally to identify the fix, but at first glance, I see room for the behavior that you're commenting on. Basically, this would require an extension to our sysbox-deploy daemonset. Will look at it when have a chance.

Mar 31 '22 16:03 rodnymolina

Meanwhile I tried to find a workaround, but failed so far.

Our Rancher uses RKEv1 to deploy the downstream kubernetes cluster, which should get sysbox enabled. I tried to define the proxy settings in /etc/default/docker but this seems to be no longer supported by docker 20.10.7 or it newer supported docker API calls which are used by Rancher RKE.

So far, it seems that it would be sufficient to apply proxy environment variables (HTTPS_PROXY and NO_PROXY) to the kubelet container. However, this seems to be more difficult in this combination as expected.

Please note that the docker daemon is well configured to use our Internet proxy to download docker images. That is working fine. The sysbox configuration is trying to download something from the Internet (k8s.gcr.io) directly from the kubelet container which finally affects all containers on the sysbox enabled node(s).

Mar 31 '22 21:03 FFock

By looking at your initial logs, it looks like kubelet is struggling to find the pause image, which in cri-o's setups is typically configured within its default config file: /etc/crio/crio.conf.

If that's the case, then cri-o may be unaware of your proxy settings, so as a potential workaround I would try to adjust your cri-o service file as indicated below:

/etc/systemd/system/crio.service

[Service]
Environment=”HTTP_PROXY=http://proxy_ip:proxy_port"
Environment=”HTTPS_PROXY=http://proxy_ip:proxy_port"

Digest systemd changes and restart cri-o:

$ sudo systemctl daemon-reload
$ sudo systemctl restart crio

Let me know if that works.

Mar 31 '22 22:03 rodnymolina

I created the /etc/systemd/system/crio.service file with the config above (adapted of course), but the error is still the same:

Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_pushprox-kube-controller-manager-client-cprjg_cattle-monitoring-system_048d41c7-d208-43d2-969b-b060103b25df_0": Error initializing source docker://k8s.gcr.io/pause:3.5: error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving

Mar 31 '22 22:03 FFock

Pulling the k8s.gcr.io/pause:3.5 image on the node using docker pull does not fix the issue too. Is this initializing code downloading the image in a docker-in-docker environment?

Mar 31 '22 22:03 FFock

The /etc/crio/crion.conf file is:

[crio]
  storage_driver = "overlay"
  storage_option = ["overlay.mountopt=metacopy=on"]

  [crio.network]
    plugin_dirs = ["/opt/cni/bin", "/home/kubernetes/bin"]

  [crio.runtime]
    cgroup_manager = "cgroupfs"
    conmon_cgroup = "pod"
    default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "SETUID", "SETGID", "SETPCAP", "SETFCAP", "NET_BIND_SERVICE", "KILL", "AUDIT_WRITE", "NET_RAW", "SYS_CHROOT", "MKNOD"]
    pids_limit = 16384

    [crio.runtime.runtimes]

      [crio.runtime.runtimes.sysbox-runc]
        allowed_annotations = ["io.kubernetes.cri-o.userns-mode"]
        runtime_path = "/usr/bin/sysbox-runc"
        runtime_type = "oci"

Mar 31 '22 22:03 FFock

Pulling the k8s.gcr.io/pause:3.5 image on the node using docker pull does not fix the issue too. Is this initializing code downloading the image in a docker-in-docker environment?

During installation, sysbox-k8s-deploy daemonset replaces docker-shim as the CRI utilized by the rke's kubelet component. That's to say that all the k8s control-plane components are re-instantiated through cri-o, with the exception of kubelet itself which continues to operate within a regular docker container.

Now, notice that the cri-o daemon being installed runs at the host level, so it could very well be that kubelet is fully aware of your http-proxy settings (as defined through dockerd config), while cri-o process is not. That's why i believe that we need to extend our sysbox-k8s-deploy daemonset to parse/extract the proper http-proxy setting from the original kubelet (or dockerd) config, and generate a crio.conf file (or containers.conf) that reflects the proper http-proxy configuration.

Coming back to your point/question above. It makes sense to me that problem is not fixed when you fetch the pause image through a docker pull instruction, as all you're doing there is to store this image in docker's image-layer caches. This won't help those pods that cri-o is attempting to launch. Instead, you can rely on the crictl command to do something equivalent from any of your affected k8s/rke nodes:

$ sudo crictl pull -n <workspace> k8s.gcr.io/pause:3.5

I'll have more details once that I have cycles to recreate the issue.

Apr 01 '22 00:04 rodnymolina

Ok, thank you for explaining the background. Things are getting clearer now. The problem can be indeed reproduced quite simple:

sudo crictl pull k8s.gcr.io/pause:3.5
FATA[0000] pulling image: rpc error: code = Unknown desc = error pinging docker registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io on 127.0.0.53:53: server misbehaving

Thus, I will now try to get the image pulled by crictl using some proxy config for it (hopefully this can be done somehow...).

Apr 01 '22 07:04 FFock

Haven't looked in details but i believe the proper config file is containers.conf. You will need to add something like this within the "engine" section and then restart crio:

...
[engine]
env = [
     "http_proxy=http://1.2.3.4:5678",
     "https_proxy=http://1.2.3.4:5678",
     "no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16",
   ]
...

Apr 01 '22 07:04 rodnymolina

Problem solved (with workaround): Adding these two lines to the [Service] section of /etc/systemd/system/crio.service fixes the issue*: *NO_PROXY might be improved with CIDR notation - I do not know yet, if crio supports it.

Environment="HTTP_PROXY=http://\<proxy\>:\<port>" "NO_PROXY=0,1,2,3,4,5,6,7,8,9,.svc,.cluster.local,localhost"
Environment="HTTPS_PROXY=http://\<proxy\>:\<port>" "NO_PROXY=0,1,2,3,4,5,6,7,8,9,.svc,.cluster.local,localhost"

To activate these settings do:

$ sudo systemctl daemon-reload
$ sudo systemctl restart crio

To verify that pulling now works:

$ sudo crictl pull k8s.gcr.io/pause:3.5
Image is up to date for k8s.gcr.io/pause@sha256:1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07

Apr 01 '22 07:04 FFock

Excellent @FFock. Let's leave this issue open till we implement a proper solution in our sysbox-k8s-deploy daemonset.

Apr 01 '22 18:04 rodnymolina

There is a similar issue in a non-k8 (simple docker) envionment. The workaround is the same. Adding the proxy parameter to the daemon or in the config file (~/.docker/config.json) in my home directory (docker manual). But it is a workaround, because it enables the proxy globally for each docker container, which is not the intention.

Apr 06 '22 22:04 fhaefemeier

@fhaefemeier, not sure I got that. Leaving aside the fact that the original issue deals with a k8s setup and you are referring to a standalone / non-k8s deployment, why do you say that both issues are alike? In other words, what would be the desired behavior here if we can't rely on docker's config files?

Apr 06 '22 23:04 rodnymolina

My workaround described above using /etc/systemd/system/crio.service has apparently a serious drawback. After adapting the settings further and doing

$ sudo systemctl daemon-reload
$ sudo systemctl restart crio

kubelet, kube-apiserver, kube-controller-manager, and kube-scheduler fail to start. A possible cause for this in the kube-apiserver log:

clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {/var/run/crio/crio.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory". Reconnecting... Any idea, what is going wrong?

Apr 07 '22 17:04 FFock

@FFock, I don't see how adding two global vars in your crio.service file can cause something like this. Looks like the crio daemon is not able to initialize for some reason.

Do you see any other relevant log that explains why crio is not properly initializing? Perhaps you added an unexpected character in the crio.service file which systemd and/or crio are unable to understand (?).

If you haven't done it yet, I would suggest to stop crio service first (systemctl stop crio), and then look carefully at the journalctl -f output while doing a journalctl start crio.

Apr 07 '22 19:04 rodnymolina

There were several kubelet processes running and their numer seemed to be corresponding on the number of times I restarted crio. These instances survived reboots and might have been the reason for the "port already in use" errors I saw in the logs too.

By installing

apt install conntrack

and killing the kubelet processes hanging around and then doing a reboot, the problems with "ports already in use" disappeared.

Starting the crio service does not show any errors that seem to be relevant. The service is running and the file /var/run/crio/crio.sock exists on the host.

Now the only other thing I see is:

server.go:292] "Failed to run kubelet" err="failed to run Kubelet: failed to create kubelet: open /dev/kmsg: no such file or directory"

Apr 07 '22 20:04 FFock

BTW, I did not add something to the /etc/systemd/system/crio.service but tried to change the original NO_PROXY setting by a version using CIDR addresses. Because normally, Rancher RKE has its means to let docker restart the rancher-agent and the kubernetes containers, I thought restarting them triggered by a systemctl restart crio could have affected the setup. Looks like I have to recreate the node, I think it is broken now and I am giving up....

Apr 07 '22 20:04 FFock

We should have this (the proxy thing) fixed shortly anyways. Will ping you when done.

Apr 07 '22 21:04 rodnymolina

Having the proxy stuff fixed by sysbox would be great because I just setup a fresh cluster and did the above steps of my workaround except that I did not do a daemon-reload and restart crio after having changed /etc/systemd/system/crio.service. Instead, I restarted the node. I now get the same issue (kubelet and other k8s processes are not starting anymore) as yesterday.

Thus, the issue is reproducible, that a reboot will cause a sysbox enabled node to fail fataly with the above crio proxy settings in a Rancher 2.6.3 deployed downstream cluster.

Apr 08 '22 10:04 FFock

A short note about the current status: I was not able to get my workaround running again. The node and its sysbox setup are not proper anymore. I am not able to deploy anything there. Maybe this is caused by a newer sysbox version I am using with the new cluster or something else. I do not think that it would be useful to paste recent errors here.

Instead I am now waiting for the official solution for sysbox-on-k8s-behind-internet-proxy.

@rodnymolina, are there any chances that this will be available by end-of-next week (or earlier)? We planned to organize a training that is already scheduled with the new sysbox approach and next week would be our deadline. Otherwise, we need to do that training with classic docker-in-docker.

Apr 08 '22 15:04 FFock

@FFock, will look into this one tomorrow.

Apr 13 '22 05:04 rodnymolina

@rodnymolina I revisited this issue when preparing our next training environment and I hoped I could now use sysbox, but the situation got even worse. Besides this still unresolved proxy issue we now get: https://github.com/nestybox/sysbox/issues/567

Jul 14 '22 12:07 FFock

sysbox sysbox copied to clipboard

Sysbox kubernetes install fails behind HTTPS Proxy

sysbox
sysbox copied to clipboard