tests icon indicating copy to clipboard operation
tests copied to clipboard

cmd: kata-local-ci, a script for local CI testing using libvirt

Open c3d opened this issue 3 years ago β€’ 38 comments

The cmd/kata-local-ci is a libvirt-based script that uses VM snapshotting to cache a pre-setup VM with everything that is required for the CI to run.

Fixes: #4261

Signed-off-by: Christophe de Dinechin [email protected]

c3d avatar Dec 08 '21 10:12 c3d

This is flagged as WIP because in the current incarnation, I have (at least) two problems:

  • On a very slow network, the setup often ends up with a unable to fork which I have not found the root cause for yet (probably some error earlier in the setup script that is not captured)
  • On a fast machine, the cached run (using the snapshot) ends up with go: Command not found, and indeed, go, which was setup correctly in the first pass, is gone. So it looks like the snapshotting did not capture all the installations. I suspect that a missing flush is the reason. Normally, I would use the --quiesce option while snapshotting, but that requires installing qemu-guest-agent, which itself requires setting up the VM with a virtio-serial, etc. So I stopped there and tried without.

The context for the two errors is below, if anybody (@wainersm, @littlejawa) has an idea:

Slow network can't fork message:

Creating config file /etc/chrony/chrony.keys with new version
invoke-rc.d: could not determine current runlevel
Created symlink /etc/systemd/system/chronyd.service β†’ /lib/systemd/system/chrony.service.
Created symlink /etc/systemd/system/multi-user.target.wants/chrony.service β†’ /lib/systemd/system/chrony.service.
Processing triggers for systemd (245.4-4ubuntu3) ...
Processing triggers for libc-bin (2.31-0ubuntu9) ...
INFO: Create symlink to /tmp in /var to create private temporal directories with systemd
INFO: Install tmp.mount in ./etc/systemd/system
INFO: Create /rootfs/etc
I am ubuntu
INFO: Configure chrony file /rootfs/etc/chrony/chrony.conf
/kata-containers/tools/osbuilder/rootfs-builder/rootfs.sh: line 561: rustup: command not found
/home/kata/go/src/github.com/kata-containers/tests /
Your branch is up to date with 'origin/main'.
Already on 'main'
error: cannot run ssh: No such file or directory
fatal: unable to fork
Failed at 566: bash ${script_dir}/../../../ci/install_rust.sh ${RUST_VERSION}
make: *** [Makefile:93: /home/kata/go/src/github.com/kata-containers/kata-containers/tools/osbuilder/.ubuntu_rootfs.done] Error 1
[install_kata_image.sh:50] ERROR: sudo -E USE_DOCKER= DISTRO=ubuntu make -e image
[install_kata.sh:22] ERROR: .ci/install_kata_image.sh 

Fast network go: Command not found message:

 ssh kata@$VM_IPADDR                                                                       (demo)
(failed reverse-i-search)`cdo': ^C
[kata@kata-fedora-ci ~]$ ^C
(failed reverse-i-search)`cd': ^C
[kata@kata-fedora-ci ~]$ ^C
[kata@kata-fedora-ci ~]$ export GOPATH=~/go
[kata@kata-fedora-ci ~]$ cd go/src/github.com/kata-containers/tests/
[kata@kata-fedora-ci tests]$ .ci/run.sh 
INFO: Running checks
make -C cmd/checkcommits
make[1]: Entering directory '/home/kata/go/src/github.com/kata-containers/tests/cmd/checkcommits'
go test .
make[1]: go: Command not found
make[1]: *** [Makefile:15: checkcommits] Error 127
make[1]: Leaving directory '/home/kata/go/src/github.com/kata-containers/tests/cmd/checkcommits'
make: *** [Makefile:44: checkcommits] Error 2
[run.sh:147] ERROR: sudo -E PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin bash -c make check
[kata@kata-fedora-ci tests]$ sudo dnf install go 
Last metadata expiration check: 0:46:51 ago on Tue 07 Dec 2021 10:41:55 AM EST.
Dependencies resolved.
====================================================================================================================================================
 Package                                    Architecture                 Version                                Repository                     Size
====================================================================================================================================================
Installing:
 golang                                     x86_64                       1.14.15-3.fc32                         updates                       638 k
Installing dependencies:
 annobin                                    x86_64                       9.27-3.fc32                            updates                        91 k
 dwz                                        x86_64                       0.13-2.fc32                            fedora                        109 k
 efi-srpm-macros                            noarch                       4-4.fc32                               fedora                         22 k
 fonts-srpm-macros                          noarch                       2.0.3-1.fc32                           fedora                         26 k
 fpc-srpm-macros                            noarch                       1.3-1.fc32                             fedora                        7.6 k
 ghc-srpm-macros                            noarch                       1.5.0-2.fc32                           fedora                        7.7 k
 gnat-srpm-macros                           noarch                       4-11.fc32                              fedora                        8.2 k
 go-srpm-macros                             noarch                       3.0.10-1.fc32                          updates                        24 k
 golang-bin                                 x86_64                       1.14.15-3.fc32                         updates                        82 M
 golang-src                                 noarch                       1.14.15-3.fc32                         updates                       6.9 M
 lua-srpm-macros                            noarch                       1-3.fc32                               updates                       8.0 k
 mercurial-lang                             x86_64                       5.2-5.fc32                             updates                       1.0 M
 nim-srpm-macros                            noarch                       3-2.fc32                               fedora                        8.3 k
 ocaml-srpm-macros                          noarch                       6-2.fc32                               fedora                        7.7 k
 openblas-srpm-macros                       noarch                       2-7.fc32                               fedora                        7.2 k
 perl-srpm-macros                           noarch                       1-34.fc32                              fedora                        8.3 k
 python-srpm-macros                         noarch                       3-60.fc32                              updates                        16 k
 python27                                   x86_64                       2.7.18-8.fc32                          updates                        11 M
 qt5-srpm-macros                            noarch                       5.14.2

c3d avatar Dec 08 '21 11:12 c3d

Also notice that in the original post-setup run, I see this:

[preflight] Running pre-flight checks
	[WARNING FileExisting-tc]: tc not found in system path
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-error

c3d avatar Dec 08 '21 11:12 c3d

@c3d, so, we're adding yet another way to run CIs locally, apart from the vagrant one? The question that comes to my mind is why not making the vagrant one do what you need and then we avoid duplication?

Also, as it's based on libvirt, I wonder why not taking advantage of things like ... kcli, for instance?

fidencio avatar Dec 08 '21 11:12 fidencio

Also, as it's based on libvirt, I wonder why not taking advantage of things like ... kcli, for instance?

Partially answering my own question, I noticed now that this is based on Alice's work and it may be less needy on the internet side, which we know to be something that hit some of our developers quite hard.

fidencio avatar Dec 09 '21 08:12 fidencio

@c3d, so, we're adding yet another way to run CIs locally, apart from the vagrant one? The question that comes to my mind is why not making the vagrant one do what you need and then we avoid duplication?

This is described in the README.

Compared to the `vagrant` approach, the main benefit is that this setup can
 easily be deployed on environment where `vagrant` is not easily available (like
 Red Hat Enteprise Linux or Red Hat Core OS systems). Also, because of the
 snapshotting, the setup is restored to the "clean" initial state at every run by
 default (this can be disabled by using the `-F` option).

Overall, the goal for me is to be able to run the CI in a reasonable amount of time. Currently it's ~3h, which is way too long, and the vagrant one run locally fails quite often due to my ugly network timeouts. In theory, I can reuse the vagrant setup from run to run, but then I get strange errors and it's unclear if they are related to the network or not starting from a clean state. With the libvirt approach, I'm doing a VM snapshot after setup, and then I restart quickly from a known good state, so that reduces the "error surface" quite a bit.

Also, as it's based on libvirt, I wonder why not taking advantage of things like ... kcli, for instance?

I think that you have some clever idea in mind, but I don't get it πŸ˜„ What would kcli help me with?

(Here, I'm setting up the CI VM the way Vagrant does, then I'm letting it do its own cluster setup, I'm not changing that part ATM)

c3d avatar Dec 10 '21 11:12 c3d

Some progress on the ssh error:

error: cannot run ssh: No such file or directory
fatal: unable to fork
Failed at 566: bash ${script_dir}/../../../ci/install_rust.sh ${RUST_VERSION}
make: *** [Makefile:93: /home/kata/go/src/github.com/kata-containers/kata-containers/tools/osbuilder/.ubuntu_rootfs.done] Error 1
[install_kata_image.sh:50] ERROR: sudo -E USE_DOCKER= DISTRO=ubuntu make -e image
[install_kata.sh:22] ERROR: .ci/install_kata_image.sh 

That seems to be due to the fact that on that machine, the git remote for origin is actually ssh based (i.e. it was [email protected]:kata-containers/tests.git instead of https://. I'm re-testing after changing that to confirm. @wainersm could you confirm that you never ran on a machine with a tests repo using an ssh-based origin?

c3d avatar Dec 10 '21 11:12 c3d

I still get k8s errors due to swap being active:

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`

There is a firstboot-cmd that does a swapoff -a, and then the swap entries are removed from /etc/fstab, but despite this the swap partition is still activated on the next reboot:

[    4.155131] Adding 629756k swap on /dev/vda3.  Priority:-2 extents:1 across:629756k FS
[    4.209091] systemd[1]: Activated swap Swap Partition.

So brute-forcing the swapoff from the script itself.

c3d avatar Dec 10 '21 11:12 c3d

Any update on this @c3d?

jodh-intel avatar Jan 06 '22 14:01 jodh-intel

A head's up that may affect your PR.

In the Architecture Committee meeting from January 25th, 2022, the Architecture Committee has agreed on using the "Dismiss stale pull request approvals when new commits are pushed" configuration from GitHub. It basically means that if your PR has been rebased or updated, the approvals given will be erased.

In order to minimize traumas due to the new approach, please, consider adding a note on the changes done before the rebase / force-push, and also pinging the reviewers for a subsequent round of reviews.

Thanks for your understanding!

Related issue: https://github.com/kata-containers/community/issues/249

fidencio avatar Jan 26 '22 12:01 fidencio

Any update on this @c3d?

@jodh-intel So far, I have hit two issues that make it hard for me to get this to serve my own purpose:

  1. On my local machine, with a very slow network, the CI setup fails 9 times out of 10. It never fails the same way twice, except that it's almost always a network timeout of some kind.
  2. On Red Hat machines, it most often fails hitting rate limits with docker.io, but so far, it fails very reliably, including if I run something early in the morning, when I would expect that the 100 downloads in the last 6 hours should never be hit. I think that docker considers all machines in our lab as a single "user". Anyway, I briefly considered replacing the various images that I see failing from other repositories (e.g. using quay.io), but did not have time to try that yet.

c3d avatar Jan 26 '22 15:01 c3d

hi everyone, I don't generally mind if you use part of the work available at https://github.com/alicefr/kata-local-ci-setup. I'd just love if you could please reference the work and give some credits for it :)

alicefr avatar Jan 26 '22 15:01 alicefr

hi everyone, I don't generally mind if you use part of the work available at https://github.com/alicefr/kata-local-ci-setup. I'd just love if you could please reference the work and give some credentials for it :)

Hi @alicefr, as you know, I tried using your work and contributed to it. As for giving you some credentials, I believe that the only reason you are looking at this PR is precisely because I did give your credits in a recent comment.

However, your scripts did not really work for me, for a variety of reasons, one of them being the use of ansible, which actually complicated things quite a bit at least for me. So I redesigned a different approach, which leverages some code from your setup script (i.e. the virt-install setup), but even that part had to be actively reworked.

Here are some of the key differences in the whole approach:

  • There is a single script instead of two, and it's all shell based instead of shell for setup and ansible for run
  • It's designed to leverage the snapshot / resume of libvirt, because on my machine, getting it to do a valid setup even once is quite challenging due to bad network conditions, so once that setup is done, I'd rather reuse it.
  • There are no intermediate steps, e.g. setting up a host.yaml file or things like that
  • The VM setup is markedly different in a number of ways

That being said, I have no problems giving you credits. Would you be OK with something like:

# Inspired from  https://github.com/alicefr/kata-local-ci-setup (Alice Frosi)

Or do you want something else?

c3d avatar Jan 26 '22 16:01 c3d

@c3d thanks for tagging me. It's hard to follow the kata issues not contributing regularly. Just a reference to the original repository in a comment or in a commit would be enough for me

alicefr avatar Jan 27 '22 07:01 alicefr

Also, as it's based on libvirt, I wonder why not taking advantage of things like ... kcli, for instance?

Partially answering my own question, I noticed now that this is based on Alice's work and it may be less needy on the internet side, which we know to be something that hit some of our developers quite hard.

Me in particular 😜 The whole idea is really to be able to do the setup once, suspend the VM, then resume with as little download as possible. Getting the initial setup to pass for me is really problematic (it takes over 10 runs for a successful completion), hence this "very personal" approach.

c3d avatar Jan 31 '22 14:01 c3d

Still making slow progress on that one. After a successful setup, I now get this error on testing:

[reset] Deleting contents of stateful directories: [/var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[cleanup_env.sh:35] INFO: Teardown the registry server
[cleanup_env.sh:39] INFO: Stop crio service
[cleanup_env.sh:42] INFO: Remove network devices
[cleanup_env.sh:44] INFO: remove device: cni0
[cleanup_env.sh:44] INFO: remove device: flannel.1
[cleanup_env.sh:54] INFO: Check no kata processes are left behind after reseting kubernetes
[cleanup_env.sh:57] INFO: Checks that pods were not left
make: *** [Makefile:57: kubernetes] Error 1
[run.sh:70] ERROR: sudo -E PATH=/usr/local/go/bin:/home/kata/go/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/go/bin bash -c make kubernetes

I see no obvious failure earlier in the run. I will try to run things directly on the VM to see if I can catch what is going wrong.

Successfully tagged localhost/stress-kata:latest
cd3de99e6218e470413942873e08d1433f4e69579ebe73824c50729c6536a367
/home/kata/go/src/github.com/kata-containers/tests/integration/kubernetes
[init.sh:279] INFO: Clean up any leftover CNI configuration
Device "cni0" does not exist.
[init.sh:282] INFO: Start crio service
Created symlink /etc/systemd/system/cri-o.service β†’ /usr/local/lib/systemd/system/crio.service.
Created symlink /etc/systemd/system/multi-user.target.wants/crio.service β†’ /usr/local/lib/systemd/system/crio.service.
● crio.service - Container Runtime Interface for OCI (CRI-O)
     Loaded: loaded (/usr/local/lib/systemd/system/crio.service; enabled; vendor preset: disabled)
     Active: active (running) since Mon 2022-01-31 16:05:32 EST; 11s ago
       Docs: https://github.com/cri-o/cri-o
   Main PID: 146156 (crio)
      Tasks: 11
     Memory: 48.9M
     CGroup: /system.slice/crio.service
             └─146156 /usr/local/bin/crio

Jan 31 16:05:31 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:31.994435281-05:00" level=info msg="RDT not available in the host system" file="rdt/rdt.go:57"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.162923097-05:00" level=info msg="Updated default CNI network name to " file="ocicni/ocicni.go:371"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.164372594-05:00" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d" file="hooks/read.go:65"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.513770587-05:00" level=debug msg="Golang's threads limit set to 56970" file="server/server.go:348"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.539020556-05:00" level=warning msg="Error encountered when checking whether cri-o should wipe images: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory" file="server/server.go:529"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.539660750-05:00" level=debug msg="Sandboxes: []" file="server/server.go:496"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.540012916-05:00" level=debug msg="Registered SIGHUP watcher for config" file="config/reload.go:32"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.540177350-05:00" level=debug msg="Metrics are disabled" file="server/server.go:507"
Jan 31 16:05:32 kata-fedora-ci crio[146156]: time="2022-01-31 16:05:32.797541457-05:00" level=debug msg="monitoring \"/usr/share/containers/oci/hooks.d\" for hooks" file="hooks/monitor.go:43"
Jan 31 16:05:32 kata-fedora-ci systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
[init.sh:285] INFO: Start Kubernetes
[init.sh:175] INFO: Init cluster using /var/run/crio/crio.sock
# zram swap disabled
[init] Using Kubernetes version: v1.23.1
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kata-fedora-ci kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.122.77]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kata-fedora-ci localhost] and IPs [192.168.122.77 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kata-fedora-ci localhost] and IPs [192.168.122.77 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[apiclient] All control plane components are healthy after 111.300229 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.23" in namespace kube-system with the configuration for the kubelets in the cluster
NOTE: The "kubelet-config-1.23" naming of the kubelet ConfigMap is deprecated. Once the UnversionedKubeletConfigMap feature gate graduates to Beta the default name will become just "kubelet-config". Kubeadm upgrade will handle this transition transparently.
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node kata-fedora-ci as control-plane by adding the labels: [node-role.kubernetes.io/master(deprecated) node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node kata-fedora-ci as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: 1ocb1i.gm4laetot1eod8dh
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.122.77:6443 --token 1ocb1i.gm4laetot1eod8dh \
	--discovery-token-ca-cert-hash sha256:bc89fb960eff4f743d5f448d4f88bafae39021aaa0cec63578ef18a98099216f 
[init.sh:203] INFO: Probing kubelet (timeout=240s)
NAME             STATUS     ROLES                  AGE   VERSION
kata-fedora-ci   NotReady   control-plane,master   51s   v1.23.1
[init.sh:288] INFO: Configure the cluster network
[init.sh:132] INFO: Use flannel v0.14.0
[init.sh:135] INFO: Use configuration file from https://raw.githubusercontent.com/coreos/flannel/v0.14.0/Documentation/kube-flannel.yml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
[init.sh:292] INFO: Wait for system's pods be ready and running
kube-system   kube-apiserver-kata-fedora-ci            1/1     Running             0          34s
kube-system   kube-controller-manager-kata-fedora-ci   1/1     Running             0          29s
kube-system   etcd-kata-fedora-ci                      1/1     Running             0          30s
kube-system   kube-scheduler-kata-fedora-ci            1/1     Running             0          57s
kube-system   coredns-64897985d-jq8f4                  1/1     Running   0             100s
[init.sh:295] INFO: Create kata RuntimeClass resource
runtimeclass.node.k8s.io/kata created
[init.sh:33] INFO: Taint 'NoSchedule' is found. Untaint the node so pods can be scheduled.
node/kata-fedora-ci untainted
[run_kubernetes_tests.sh:106] INFO: Run tests
1..1
ok 1 Running with postStart and preStop handlers
1..1
ok 1 Block Storage Support
1..1
ok 1 Check capabilities of pod
1..1
ok 1 ConfigMap for a pod
1..2
ok 1 Copy file in a pod
ok 2 Copy from pod to host
1..1
ok 1 Check CPU constraints
1..1
ok 1 Credentials using secrets
1..1
ok 1 Check custom dns
1..2
ok 1 Empty dir volumes
ok 2 Empty dir volume when FSGroup is specified with non-root container
1..1
ok 1 Environment variables
1..1
ok 1 Kubectl exec
1..1
ok 1 Expose IP Address
1..1
not ok 1 configmap update works, and preserves symlinks
# (in test file k8s-inotify.bats, line 27)
#   `kubectl apply -f "${pod_config_dir}"/inotify-updated-configmap.yaml' failed
# [bats-exec-test:32] INFO: k8s configured to use runtimeclass
# configmap/cm created
# pod/inotify-configmap-testing created
# pod/inotify-configmap-testing condition met
# error: error when retrieving current configuration of:
# Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"
# Name: "cm", Namespace: "default"
# from server for: "/home/kata/go/src/github.com/kata-containers/tests/integration/kubernetes/runtimeclass_workloads/inotify-updated-configmap.yaml": Get "https://192.168.122.77:6443/api/v1/namespaces/default/configmaps/cm": dial tcp 192.168.122.77:6443: connect: connection refused
# The connection to the server 192.168.122.77:6443 was refused - did you specify the right host or port?
# The connection to the server 192.168.122.77:6443 was refused - did you specify the right host or port?
[run_kubernetes_tests.sh:109] ERROR: bats k8s-inotify.bats
[run_kubernetes_tests.sh:71] INFO: Run the cleanup routine
[cleanup_env.sh:31] INFO: Reset Kubernetes
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0131 16:25:54.374935  154293 reset.go:101] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: Get "https://192.168.122.77:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 192.168.122.77:6443: connect: connection refused
W0131 16:25:54.375325  154293 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[cleanup_env.sh:35] INFO: Teardown the registry server
[cleanup_env.sh:39] INFO: Stop crio service
[cleanup_env.sh:42] INFO: Remove network devices
[cleanup_env.sh:44] INFO: remove device: cni0
[cleanup_env.sh:44] INFO: remove device: flannel.1
[cleanup_env.sh:54] INFO: Check no kata processes are left behind after reseting kubernetes
[cleanup_env.sh:57] INFO: Checks that pods were not left
make: *** [Makefile:57: kubernetes] Error 1
[run.sh:70] ERROR: sudo -E PATH=/usr/local/go/bin:/home/kata/go/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/go/bin bash -c make kubernetes
cmd/kata-local-ci -f  40.08s user 9.52s system 0% cpu 2:09:05.62 total

c3d avatar Feb 01 '22 10:02 c3d

/test

GabyCT avatar Feb 23 '22 16:02 GabyCT

/test

GabyCT avatar Mar 10 '22 20:03 GabyCT

Hi @c3d - is this still wip?

jodh-intel avatar Mar 25 '22 08:03 jodh-intel

After getting a better network at home, I finally had some success with the script:

* STEP: start kata-monitor
OK: start kata-monitor (131387)

* STEP: kata-monitor cache update checks
.OK: retrieve d63955fe3a79d457cf6acf9c0bae4db5c6d6bc7a75c6cb9db639f823fb695991 in kata-monitor cache
....................OK: runc pod 5b55f85566c130cdc48576cfe9117c0f5887e066db30722b4fa8539e037da1a4 skipped from kata-monitor cache

* STEP: kata-monitor metrics retrieval
OK: retrieve metrics from kata-monitor
OK: retrieve metrics for pod d63955fe3a79d457cf6acf9c0bae4db5c6d6bc7a75c6cb9db639f823fb695991 - found #1141 metrics

* STEP: remove kata workload
OK: stop workload (kata)

* STEP: kata-monitor cache update checks (removal)
OK: verify removal of d63955fe3a79d457cf6acf9c0bae4db5c6d6bc7a75c6cb9db639f823fb695991 from kata-monitor cache

kata-monitor testing: PASSED!

cmd/kata-local-ci -f  37.28s user 8.76s system 1% cpu 49:43.30 total

This is good news to me. Now I need to rework the patches before submitting it. Also, only was successful with Fedora CI so far, the others not really tested.

c3d avatar Mar 31 '22 15:03 c3d

The cached version still fails, although the good news is that it fails in less than 3 minutes (as opposed to ~50 minutes for the original run), so this suggests that the caching mechanism is worth it:

[reset] Deleting contents of stateful directories: [/var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
[cleanup_env.sh:35] INFO: Teardown the registry server
[cleanup_env.sh:39] INFO: Stop crio service
[cleanup_env.sh:42] INFO: Remove network devices
[cleanup_env.sh:44] INFO: remove device: cni0
Cannot find device "cni0"
Cannot find device "cni0"
[cleanup_env.sh:44] INFO: remove device: flannel.1
Cannot find device "flannel.1"
Cannot find device "flannel.1"
[cleanup_env.sh:54] INFO: Check no kata processes are left behind after reseting kubernetes
[cleanup_env.sh:57] INFO: Checks that pods were not left
Not /run/vc/sbs directory found
make: *** [Makefile:57: kubernetes] Error 125
[run.sh:66] ERROR: sudo -E PATH=/usr/local/go/bin:/home/kata/go/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/go/bin bash -c make kubernetes
cmd/kata-local-ci  0.17s user 0.18s system 0% cpu 2:26.90 total

c3d avatar Mar 31 '22 16:03 c3d

Apparently, part of the problem is that when the VM image is restored, the time in the VM is not correct. I get this kind of errors along the way:

STEP 1/2: FROM quay.io/libpod/ubuntu:latest
STEP 2/2: RUN apt-get -y update &&     apt-get -y upgrade &&     apt-get -y --no-install-recommends install stress &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages [11.3 MB]
Get:7 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]
Reading package lists...
E: Release file for http://security.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 13min 46s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 14min 25s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-backports/InRelease is not valid yet (invalid for another 15min 30s). Updates for this repository will not be applied.
Error: error building at STEP "RUN apt-get -y update &&     apt-get -y upgrade &&     apt-get -y --no-install-recommends install stress &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*": error while running runtime: exit status 100
[run_kubernetes_tests.sh:102] ERROR: bash ./init.sh

According to this discussion, this indicates a time sync issue.

c3d avatar Mar 31 '22 17:03 c3d

In order to sync, the correct approach seems to be hwclock --hctosys. Also tried chronyc makestep, but that does not seem sufficient. The container builds now work, even with podman.

c3d avatar Mar 31 '22 18:03 c3d

Got a successful run in cached mode, about 21 minutes, down from about 50 minutes for the non-cached one with full VM setup. So over half of the time saved. This was worth the effort.


* STEP: kata-monitor metrics retrieval
OK: retrieve metrics from kata-monitor
OK: retrieve metrics for pod 701565541ea4f2fd52d10ad2883f516e35d91613f3d4442c3a9b26e7c2155fa4 - found #1141 metrics

* STEP: remove kata workload
OK: stop workload (kata)

* STEP: kata-monitor cache update checks (removal)
OK: verify removal of 701565541ea4f2fd52d10ad2883f516e35d91613f3d4442c3a9b26e7c2155fa4 from kata-monitor cache

kata-monitor testing: PASSED!

cmd/kata-local-ci  0.19s user 0.23s system 0% cpu 20:58.06 total

c3d avatar Apr 01 '22 07:04 c3d

On turbo, a faster machine but where VMs are backed on ZFS, got a successful non-cached run, but much slower:

cmd/kata-local-ci -f  46.60s user 13.79s system 1% cpu 1:23:55.67 total

c3d avatar Apr 01 '22 09:04 c3d

On big, got a failure in cached mode after several iterations, so it's not stable yet.

[snip]
#   Volumes:        <none>
# Events:
#   Type    Reason            Age   From                    Message
#   ----    ------            ----  ----                    -------
#   Normal  SuccessfulCreate  1s    replication-controller  Created pod: replicationtest-vh628
# replicationcontroller "replicationtest" deleted

Investigating what causes this kind of failure, this is a dup of #4653.

c3d avatar Apr 01 '22 09:04 c3d

Centos fails (predictably).

[ 187.5] Installing packages: git rsync curl make tar
CentOS-8 - AppStream                             92  B/s |  38  B     00:00    
Error: Failed to download metadata for repo 'AppStream': Cannot prepare internal mirrorlist: No URLs in mirrorlist
virt-builder: error: dnf -y install 'git' 'rsync' 'curl' 'make' 'tar': 
command exited with an error

If reporting bugs, run virt-builder with debugging enabled and include the 
complete output:

  virt-builder -v -x [...]
cmd/kata-local-ci -d centos  44.90s user 10.13s system 28% cpu 3:09.95 total

c3d avatar Apr 01 '22 16:04 c3d

Ubuntu also fails, but this one is more easily fixable:

Building dependency tree...
Reading state information...

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

E: Unable to locate package tc
cmd/kata-local-ci -d ubuntu  62.01s user 18.55s system 9% cpu 13:55.21 total

c3d avatar Apr 01 '22 16:04 c3d

Hi @c3d - is this still wip?

@jodh-intel Still wip, but now finally making progress, i.e. working quite often when I run it at home.

Reason for wip is that the branch currently contains various minor fixes found along the way, e.g. for #4653, and an improvement I was using to find an alternative to CentOS (AlmaLinux, #4661). I need to cleanup a bit and put these unrelated things in different branches.

c3d avatar Apr 04 '22 15:04 c3d

Had a success with fedora32.

time cmd/kata-local-ci -d fedora32
******************************************** [11:13:12]
*
* Reverting to initial snapshot of kata-fedora32-ci
*
********************************************
# sudo virsh snapshot-revert kata-fedora32-ci initial-snapshot

******************************************** [11:14:12]
[...]
* STEP: remove kata workload
OK: stop workload (kata)

* STEP: kata-monitor cache update checks (removal)
OK: verify removal of 5114fd77d24e944e50f2ae97616dceaf6d9d97a2d92490e2c1343b0149e9823b from kata-monitor cache

kata-monitor testing: PASSED!

cmd/kata-local-ci -d fedora32  0.24s user 0.49s system 0% cpu 39:25.72 total

c3d avatar Apr 05 '22 12:04 c3d

Still failing with ubuntu20.04

[  45.6] Resizing (using virt-resize) to expand the disk to 50.0G
virt-resize: error: /dev/sda5: partition not found in the source disk image 
(this error came from β€˜--expand’ option on the command line).  Try 
running this command: virt-filesystems --partitions --long -a 
/var/tmp/vbeeb92e.img

If reporting bugs, run virt-resize with debugging enabled and include the 
complete output:

  virt-resize -v -x [...]

This is really an issue with libguestfs not knowing how to deal with logical partitions, which apparently became mandatory for Ubuntu 20.04.

c3d avatar Apr 05 '22 13:04 c3d