Automate recoverability tests
/kind user-story
User Story
As a quality engineer I want to to run automated recoverability tests So that we can ensure in an automated fashion that interactive mode recovers from disconnection to cluster over different situations
Acceptance Criteria
- [ ] It should cover recoverability test scenarios defined in #5654
- [ ] update .ibm/pipelines/kubernetes-tests.sh and .ibm/pipelines/openshift-tests.sh
Links
- Related Epic (mandatory):
/kind user-story
Tried the following approaches to emulate disconnection from the cluster:
- using
oc logout, however this option somewhat messes up the test process, and it takes a while for the logout process to complete, so before it logouts the next code change is detected and a push is started and completed - changing the
Kubeconfigfile to remove references to the cluster under use, however the content ofKubeconfigis loaded in memory during the test session so this approach does not work - used
iptablescommand to drop IP address, however this became a mess It might be less expensive to just run the tests manually.
We will try to simulate network latency. Reviewing article suggested by Armel: https://medium.com/@kazushi/simulate-high-latency-network-using-docker-containerand-tc-commands-a3e503ea4307
Getting the following error when trying to add the delay
$ podman exec client tc qdisc add dev eth0 root netem delay 100ms
Error: Specified qdisc kind is unknown.
Armel suggested to use docker instead of podman.
Getting the following error when trying to add the delay
$ podman exec client tc qdisc add dev eth0 root netem delay 100ms Error: Specified qdisc kind is unknown.Armel suggested to use docker instead of podman.
Just a hypothesis, but I don't think this would solve the issue. Instead, I suspect an issue with the VirtualBox VM you were using. Did you manage to install the Guest Additions and retry, to see if the kernel module was properly there and loaded in the VM?
For the record, below is the (long) command I used to quickly test adding latency to the default interface of a local container. The downside is that it requires the sch_netem kernel module to be loaded in the host and NET_ADMIN capabilities in the container to do so:
podman container run --rm --cap-add=NET_ADMIN \
-it alpine:3.16 \
/bin/sh -c 'apk add --no-cache iproute2-tc iputils && \
echo === No extra latency === && \
ping -c 1 1.1.1.1 && \
echo && \
echo === Latency: 7s === && \
tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \
ping -c 1 1.1.1.1'
The guest additions were installed already, I upgraded the VM from fedora34ws to fedora36ws and reinstalled the guest additions: still get the same:
sudo podman container run --rm --cap-add=NET_ADMIN \
-it alpine:3.16 \
/bin/sh -c 'apk add --no-cache iproute2-tc iputils && \
echo === No extra latency === && \
ping -c 1 1.1.1.1 && \
echo && \
echo === Latency: 7s === && \
tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \
ping -c 1 1.1.1.1'
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
[sudo] password for rnapoles:
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:3.16...
Getting image source signatures
Copying blob 530afca65e2e done
Copying config d7d3d98c85 done
Writing manifest to image destination
Storing signatures
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz
(1/10) Installing libcap (2.64-r0)
(2/10) Installing libbz2 (1.0.8-r1)
(3/10) Installing fts (1.2.7-r1)
(4/10) Installing xz-libs (5.2.5-r1)
(5/10) Installing libelf (0.186-r0)
(6/10) Installing libmnl (1.0.5-r0)
(7/10) Installing libnftnl (1.2.1-r0)
(8/10) Installing iptables (1.8.8-r1)
(9/10) Installing iproute2-tc (5.17.0-r0)
(10/10) Installing iputils (20211215-r0)
Executing busybox-1.35.0-r15.trigger
OK: 10 MiB in 24 packages
=== No extra latency ===
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=61 time=23.0 ms
--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 23.016/23.016/23.016/0.000 ms
=== Latency: 7s ===
Error: Specified qdisc not found.
The guest additions were installed already, I upgraded the VM from fedora34ws to fedora36ws and reinstalled the guest additions: still get the same:
sudo podman container run --rm --cap-add=NET_ADMIN \ -it alpine:3.16 \ /bin/sh -c 'apk add --no-cache iproute2-tc iputils && \ echo === No extra latency === && \ ping -c 1 1.1.1.1 && \ echo && \ echo === Latency: 7s === && \ tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \ ping -c 1 1.1.1.1' We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things: #1) Respect the privacy of others. #2) Think before you type. #3) With great power comes great responsibility. [sudo] password for rnapoles: Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf) Trying to pull docker.io/library/alpine:3.16... Getting image source signatures Copying blob 530afca65e2e done Copying config d7d3d98c85 done Writing manifest to image destination Storing signatures fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz[](https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz) fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz[](https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz) (1/10) Installing libcap (2.64-r0) (2/10) Installing libbz2 (1.0.8-r1) (3/10) Installing fts (1.2.7-r1) (4/10) Installing xz-libs (5.2.5-r1) (5/10) Installing libelf (0.186-r0) (6/10) Installing libmnl (1.0.5-r0) (7/10) Installing libnftnl (1.2.1-r0) (8/10) Installing iptables (1.8.8-r1) (9/10) Installing iproute2-tc (5.17.0-r0) (10/10) Installing iputils (20211215-r0) Executing busybox-1.35.0-r15.trigger OK: 10 MiB in 24 packages === No extra latency === PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data. 64 bytes from 1.1.1.1: icmp_seq=1 ttl=61 time=23.0 ms --- 1.1.1.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 23.016/23.016/23.016/0.000 ms === Latency: 7s === Error: Specified qdisc not found.
What does this command return in your Fedora VM?
grep -i NETEM /boot/config-`uname -r`
$ grep -i NETEM /boot/config-`uname -r`
CONFIG_NET_SCH_NETEM=m
I will try in a linux based container
PR submitted
We found that tc commands can not be used in the IBM Cloud. Found https://github.com/Shopify/toxiproxy/tree/master/client that can emulate latency and network disconnection. Will try using it.
Armel found https://github.com/jamesmoriarty/goforward, which can be used with kubectl config set-cluster kind-local-k8s-cluster --proxy-url=http://localhost:8888/ to set the proxy.
We have to run recoverability tests independently of other tests to avoid impacting other tests with the added latency.
The challenge is that we need to start the tests without latency then add latency, then remove the latency. This implies making changes to the kubeconfig file. Anand will take over this issue, he will try to create helper functions to achieve the above.