odo icon indicating copy to clipboard operation
odo copied to clipboard

Automate recoverability tests

Open rnapoles-rh opened this issue 3 years ago • 11 comments

/kind user-story

User Story

As a quality engineer I want to to run automated recoverability tests So that we can ensure in an automated fashion that interactive mode recovers from disconnection to cluster over different situations

Acceptance Criteria

  • [ ] It should cover recoverability test scenarios defined in #5654
  • [ ] update .ibm/pipelines/kubernetes-tests.sh and .ibm/pipelines/openshift-tests.sh

Links

  • Related Epic (mandatory):

/kind user-story

rnapoles-rh avatar Apr 12 '22 12:04 rnapoles-rh

Tried the following approaches to emulate disconnection from the cluster:

  • using oc logout, however this option somewhat messes up the test process, and it takes a while for the logout process to complete, so before it logouts the next code change is detected and a push is started and completed
  • changing the Kubeconfig file to remove references to the cluster under use, however the content of Kubeconfig is loaded in memory during the test session so this approach does not work
  • used iptables command to drop IP address, however this became a mess It might be less expensive to just run the tests manually.

rnapoles-rh avatar May 10 '22 13:05 rnapoles-rh

We will try to simulate network latency. Reviewing article suggested by Armel: https://medium.com/@kazushi/simulate-high-latency-network-using-docker-containerand-tc-commands-a3e503ea4307

rnapoles-rh avatar Jul 27 '22 12:07 rnapoles-rh

Getting the following error when trying to add the delay

$ podman exec client tc qdisc add dev eth0 root netem delay 100ms
Error: Specified qdisc kind is unknown.

Armel suggested to use docker instead of podman.

rnapoles-rh avatar Aug 04 '22 12:08 rnapoles-rh

Getting the following error when trying to add the delay

$ podman exec client tc qdisc add dev eth0 root netem delay 100ms
Error: Specified qdisc kind is unknown.

Armel suggested to use docker instead of podman.

Just a hypothesis, but I don't think this would solve the issue. Instead, I suspect an issue with the VirtualBox VM you were using. Did you manage to install the Guest Additions and retry, to see if the kernel module was properly there and loaded in the VM?

For the record, below is the (long) command I used to quickly test adding latency to the default interface of a local container. The downside is that it requires the sch_netem kernel module to be loaded in the host and NET_ADMIN capabilities in the container to do so:

podman container run --rm --cap-add=NET_ADMIN \
  -it alpine:3.16 \
  /bin/sh -c 'apk add --no-cache iproute2-tc iputils && \
    echo === No extra latency === && \
    ping -c 1 1.1.1.1 && \
    echo && \
    echo === Latency: 7s === && \
    tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \
    ping -c 1 1.1.1.1'

rm3l avatar Aug 05 '22 08:08 rm3l

The guest additions were installed already, I upgraded the VM from fedora34ws to fedora36ws and reinstalled the guest additions: still get the same:

sudo podman container run --rm --cap-add=NET_ADMIN \
  -it alpine:3.16 \
  /bin/sh -c 'apk add --no-cache iproute2-tc iputils && \
    echo === No extra latency === && \
    ping -c 1 1.1.1.1 && \
    echo && \
    echo === Latency: 7s === && \
    tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \
    ping -c 1 1.1.1.1'

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for rnapoles: 
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:3.16...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz
(1/10) Installing libcap (2.64-r0)
(2/10) Installing libbz2 (1.0.8-r1)
(3/10) Installing fts (1.2.7-r1)
(4/10) Installing xz-libs (5.2.5-r1)
(5/10) Installing libelf (0.186-r0)
(6/10) Installing libmnl (1.0.5-r0)
(7/10) Installing libnftnl (1.2.1-r0)
(8/10) Installing iptables (1.8.8-r1)
(9/10) Installing iproute2-tc (5.17.0-r0)
(10/10) Installing iputils (20211215-r0)
Executing busybox-1.35.0-r15.trigger
OK: 10 MiB in 24 packages
=== No extra latency ===
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=61 time=23.0 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 23.016/23.016/23.016/0.000 ms

=== Latency: 7s ===
Error: Specified qdisc not found.

rnapoles-rh avatar Aug 05 '22 20:08 rnapoles-rh

The guest additions were installed already, I upgraded the VM from fedora34ws to fedora36ws and reinstalled the guest additions: still get the same:

sudo podman container run --rm --cap-add=NET_ADMIN \
  -it alpine:3.16 \
  /bin/sh -c 'apk add --no-cache iproute2-tc iputils && \
    echo === No extra latency === && \
    ping -c 1 1.1.1.1 && \
    echo && \
    echo === Latency: 7s === && \
    tc qdisc add dev $(ip route | grep default | sed -e "s/^.*dev.//" -e "s/.proto.*//") root netem delay 7000ms && \
    ping -c 1 1.1.1.1'

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for rnapoles: 
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:3.16...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz[](https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz)
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz[](https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz)
(1/10) Installing libcap (2.64-r0)
(2/10) Installing libbz2 (1.0.8-r1)
(3/10) Installing fts (1.2.7-r1)
(4/10) Installing xz-libs (5.2.5-r1)
(5/10) Installing libelf (0.186-r0)
(6/10) Installing libmnl (1.0.5-r0)
(7/10) Installing libnftnl (1.2.1-r0)
(8/10) Installing iptables (1.8.8-r1)
(9/10) Installing iproute2-tc (5.17.0-r0)
(10/10) Installing iputils (20211215-r0)
Executing busybox-1.35.0-r15.trigger
OK: 10 MiB in 24 packages
=== No extra latency ===
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=61 time=23.0 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 23.016/23.016/23.016/0.000 ms

=== Latency: 7s ===
Error: Specified qdisc not found.

What does this command return in your Fedora VM?

grep -i NETEM /boot/config-`uname -r`

rm3l avatar Aug 08 '22 11:08 rm3l

$ grep -i NETEM /boot/config-`uname -r`
CONFIG_NET_SCH_NETEM=m

rnapoles-rh avatar Aug 08 '22 12:08 rnapoles-rh

I will try in a linux based container

rnapoles-rh avatar Aug 08 '22 12:08 rnapoles-rh

PR submitted

rnapoles-rh avatar Aug 31 '22 12:08 rnapoles-rh

We found that tc commands can not be used in the IBM Cloud. Found https://github.com/Shopify/toxiproxy/tree/master/client that can emulate latency and network disconnection. Will try using it.

rnapoles-rh avatar Sep 15 '22 13:09 rnapoles-rh

Armel found https://github.com/jamesmoriarty/goforward, which can be used with kubectl config set-cluster kind-local-k8s-cluster --proxy-url=http://localhost:8888/ to set the proxy. We have to run recoverability tests independently of other tests to avoid impacting other tests with the added latency. The challenge is that we need to start the tests without latency then add latency, then remove the latency. This implies making changes to the kubeconfig file. Anand will take over this issue, he will try to create helper functions to achieve the above.

rnapoles-rh avatar Oct 03 '22 12:10 rnapoles-rh