eraser e2e test has intermittent failures

What steps did you take and what happened:

$ make docker-build-manager docker-build-eraser OUTPUT_TYPE="type=image"
$ for i in {0..5}; do make e2e-test; done

What happened is that the tests failed intermittently.

What did you expect to happen: I expected the tests to pass every time, or to fail every time.

Anything else you would like to add:

Environment:

Eraser version: v0.1.0
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-08T00:29:07Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}

Apr 12 '22 16:04 pmengelbert

The failure I experienced was eraser_test.go:254: timeout waiting for images to be cleaned.

It could be that the process just needs a longer timeout. It could be that something else is going wrong which is causing the timeout. More detailed logging about the state of the program at timeout will help.

Apr 13 '22 17:04 pmengelbert

After I ran the tests, I got the following:

$ kubectl logs -n eraser-system eraser-eraser-e2e-test-worker2 |jq
{
  "level": "info",
  "ts": 1650037176.5409193,
  "logger": "eraser",
  "msg": "Image is not on node",
  "image": "sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767"
}
{
  "level": "info",
  "ts": 1650037176.666359,
  "logger": "eraser",
  "msg": "Removed",
  "image": "docker.io/library/nginx:latest"
}
{
  "level": "info",
  "ts": 1650037176.6663945,
  "logger": "eraser",
  "msg": "Image is not on node",
  "image": "nginx"
}

$ docker exec eraser-e2e-test-worker2 ctr -n k8s.io images list  | grep nginx
docker.io/library/nginx:latest                                                                  application/vnd.docker.distribution.manifest.list.v2+json sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 54.1 MiB  linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/s390x io.cri-containerd.image=managed
docker.io/library/nginx@sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 application/vnd.docker.distribution.manifest.list.v2+json sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 54.1 MiB  linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/s390x io.cri-containerd.image=managed

this leads me to believe this is caused by #130

Apr 15 '22 15:04 pmengelbert

In addition, the tests are not always removing the imagejob object in the teardown, which may be causing repeated failures. Once the tests fail once they tend to continue to fail until the imagejob is removed.

Apr 15 '22 16:04 pmengelbert

I just got the failure:

{
  "level": "info",
  "ts": 1650043460.1393516,
  "logger": "eraser",
  "msg": "Image is not on node",
  "image": "sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767"
}
{
  "level": "info",
  "ts": 1650043460.1393642,
  "logger": "eraser",
  "msg": "Image is running",
  "image": "docker.io/library/nginx:latest"
}
{
  "level": "info",
  "ts": 1650043460.1393712,
  "logger": "eraser",
  "msg": "Image is not on node",
  "image": "nginx"
}

This suggests that the wait.For calls are not sufficient to establish that the cluster is ready for the eraser config to be deployed:

			if err := wait.For(conditions.New(client.Resources()).ResourceDeleted(dep), wait.WithTimeout(time.Minute*1)); err != nil {
				// Let's not mark this as an error
				// We only have this to prevent race conditions with the eraser spinning up
				t.Logf("error while waiting for deployment deletion: %v", err)
			}
			if err := wait.For(conditions.New(client.Resources()).ResourcesDeleted(&pods), wait.WithTimeout(time.Minute)); err != nil {
				// Same as above, we aren't really interested in this error except for debugging problems later on.
				// We are only waiting for these pods so we don't hit race conditions with the eraser pod.
				t.Logf("error waiting for pods to be deleted: %v", err)
			}

Apr 15 '22 17:04 pmengelbert

I can confirm that the wait is insufficient. If I put a time.Sleep between the wait.Fors and the eraser deployment, it seems to resolve the issue.

Obviously, time.Sleep is not the right solution, so we will need to change how we're confirming that the pods are deleted.

Apr 15 '22 20:04 pmengelbert

There are still intermittent timeouts in the CI. The problem is much better since #143, but there are still occasional instances where timeout occurs without a discernable reason.

May 02 '22 19:05 pmengelbert

/assign @eedorenko

May 16 '22 23:05 helayoty

@helayoty We haven't had any failures in the CI in a couple of weeks, so I'm going to close this. The tests are stable enough that they don't get in the way, at least for now.

May 17 '22 22:05 pmengelbert

We just had another one in the CI for PR #190 . It appears unrelated to any changes in that PR. Here are the logs:

CGO_ENABLED=0 IMAGE=ghcr.io/azure/eraser:v0.1.0 MANAGER_IMAGE=ghcr.io/azure/eraser-manager:v0.1.0 NODE_VERSION=kindest/node:v1.23.0 go test -count=1 -timeout=1200s  -tags=e2e -v ./test/e2e
I0518 18:08:05.5[9](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:10)2171   13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg4226305[10](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:11) --namespace=eraser-system -f /home/runner/work/eraser/eraser/manifest_staging/deploy/eraser.yaml
=== RUN   TestRemoveImagesFromAllNodes
=== RUN   TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes
=== RUN   TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/deployment_successfully_deployed
=== RUN   TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/Images_successfully_deleted_from_all_nodes
I0518 18:08:43.968305   13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
    eraser_test.go:102: timeout waiting for images to be cleaned
=== RUN   TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/Pods_from_imagejobs_are_cleaned_up
    eraser_test.go:[12](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:13)2: error waiting for pods to be deleted: timed out waiting for the condition
I0518 18:10:46.308743   [13](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:14)668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
I0518 18:10:46.396133   13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagejob --all
I0518 18:10:46.650608   13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagelist --all
=== RUN   TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster
=== RUN   TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/Deployments_successfully_deployed
=== RUN   TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/Remove_some_of_the_deployments_so_the_images_aren't_running
=== RUN   TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/All_non-running_images_are_removed_from_the_cluster
    eraser_test.go:265: timeout waiting for images to be cleaned
    eraser_test.go:269: timeout waiting for images to be cleaned
I0518 18:[17](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:18):24.685032   13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagejob --all
I05[18](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:19) 18:17:24.907566   13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagelist --all
=== RUN   TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile
=== RUN   TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Deployments_successfully_deployed
=== RUN   TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Remove_deployments_so_the_images_aren't_running
=== RUN   TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Deploy_imagelist_to_remove_nginx
I0518 18:18:31.2[20](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:21)444   13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg4[22](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:23)6[30](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:31)510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
Error: The operation was canceled.

May 18 '22 21:05 pmengelbert

The most consistent failure at this point appears to be the imagelist_alias test.

Dec 06 '22 16:12 pmengelbert

eraser eraser copied to clipboard

e2e test has intermittent failures

eraser
eraser copied to clipboard