eraser
eraser copied to clipboard
e2e test has intermittent failures
What steps did you take and what happened:
$ make docker-build-manager docker-build-eraser OUTPUT_TYPE="type=image"
$ for i in {0..5}; do make e2e-test; done
What happened is that the tests failed intermittently.
What did you expect to happen: I expected the tests to pass every time, or to fail every time.
Anything else you would like to add:
Environment:
- Eraser version: v0.1.0
- Kubernetes version: (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-08T00:29:07Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
The failure I experienced was eraser_test.go:254: timeout waiting for images to be cleaned.
It could be that the process just needs a longer timeout. It could be that something else is going wrong which is causing the timeout. More detailed logging about the state of the program at timeout will help.
After I ran the tests, I got the following:
$ kubectl logs -n eraser-system eraser-eraser-e2e-test-worker2 |jq
{
"level": "info",
"ts": 1650037176.5409193,
"logger": "eraser",
"msg": "Image is not on node",
"image": "sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767"
}
{
"level": "info",
"ts": 1650037176.666359,
"logger": "eraser",
"msg": "Removed",
"image": "docker.io/library/nginx:latest"
}
{
"level": "info",
"ts": 1650037176.6663945,
"logger": "eraser",
"msg": "Image is not on node",
"image": "nginx"
}
$ docker exec eraser-e2e-test-worker2 ctr -n k8s.io images list | grep nginx
docker.io/library/nginx:latest application/vnd.docker.distribution.manifest.list.v2+json sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 54.1 MiB linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/s390x io.cri-containerd.image=managed
docker.io/library/nginx@sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 application/vnd.docker.distribution.manifest.list.v2+json sha256:2275af0f20d71b293916f1958f8497f987b8d8fd8113df54635f2a5915002bf1 54.1 MiB linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/s390x io.cri-containerd.image=managed
this leads me to believe this is caused by #130
In addition, the tests are not always removing the imagejob object in the teardown, which may be causing repeated failures. Once the tests fail once they tend to continue to fail until the imagejob is removed.
I just got the failure:
{
"level": "info",
"ts": 1650043460.1393516,
"logger": "eraser",
"msg": "Image is not on node",
"image": "sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767"
}
{
"level": "info",
"ts": 1650043460.1393642,
"logger": "eraser",
"msg": "Image is running",
"image": "docker.io/library/nginx:latest"
}
{
"level": "info",
"ts": 1650043460.1393712,
"logger": "eraser",
"msg": "Image is not on node",
"image": "nginx"
}
This suggests that the wait.For calls are not sufficient to establish that the cluster is ready for the eraser config to be deployed:
if err := wait.For(conditions.New(client.Resources()).ResourceDeleted(dep), wait.WithTimeout(time.Minute*1)); err != nil {
// Let's not mark this as an error
// We only have this to prevent race conditions with the eraser spinning up
t.Logf("error while waiting for deployment deletion: %v", err)
}
if err := wait.For(conditions.New(client.Resources()).ResourcesDeleted(&pods), wait.WithTimeout(time.Minute)); err != nil {
// Same as above, we aren't really interested in this error except for debugging problems later on.
// We are only waiting for these pods so we don't hit race conditions with the eraser pod.
t.Logf("error waiting for pods to be deleted: %v", err)
}
I can confirm that the wait is insufficient. If I put a time.Sleep between the wait.Fors and the eraser deployment, it seems to resolve the issue.
Obviously, time.Sleep is not the right solution, so we will need to change how we're confirming that the pods are deleted.
There are still intermittent timeouts in the CI. The problem is much better since #143, but there are still occasional instances where timeout occurs without a discernable reason.
/assign @eedorenko
@helayoty We haven't had any failures in the CI in a couple of weeks, so I'm going to close this. The tests are stable enough that they don't get in the way, at least for now.
We just had another one in the CI for PR #190 . It appears unrelated to any changes in that PR. Here are the logs:
CGO_ENABLED=0 IMAGE=ghcr.io/azure/eraser:v0.1.0 MANAGER_IMAGE=ghcr.io/azure/eraser-manager:v0.1.0 NODE_VERSION=kindest/node:v1.23.0 go test -count=1 -timeout=1200s -tags=e2e -v ./test/e2e
I0518 18:08:05.5[9](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:10)2171 13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg4226305[10](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:11) --namespace=eraser-system -f /home/runner/work/eraser/eraser/manifest_staging/deploy/eraser.yaml
=== RUN TestRemoveImagesFromAllNodes
=== RUN TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes
=== RUN TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/deployment_successfully_deployed
=== RUN TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/Images_successfully_deleted_from_all_nodes
I0518 18:08:43.968305 13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
eraser_test.go:102: timeout waiting for images to be cleaned
=== RUN TestRemoveImagesFromAllNodes/Test_Remove_Image_From_All_Nodes/Pods_from_imagejobs_are_cleaned_up
eraser_test.go:[12](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:13)2: error waiting for pods to be deleted: timed out waiting for the condition
I0518 18:10:46.308743 [13](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:14)668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
I0518 18:10:46.396133 13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagejob --all
I0518 18:10:46.650608 13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagelist --all
=== RUN TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster
=== RUN TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/Deployments_successfully_deployed
=== RUN TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/Remove_some_of_the_deployments_so_the_images_aren't_running
=== RUN TestRemoveImagesFromAllNodes/Remove_all_non-running_images_from_cluster/All_non-running_images_are_removed_from_the_cluster
eraser_test.go:265: timeout waiting for images to be cleaned
eraser_test.go:269: timeout waiting for images to be cleaned
I0518 18:[17](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:18):24.685032 13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagejob --all
I05[18](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:19) 18:17:24.907566 13668 kubectl.go:94] kubectl delete --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg422630510 --namespace=eraser-system imagelist --all
=== RUN TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile
=== RUN TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Deployments_successfully_deployed
=== RUN TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Remove_deployments_so_the_images_aren't_running
=== RUN TestRemoveImagesFromAllNodes/Test_Updating_ImageList_to_Reconcile/Deploy_imagelist_to_remove_nginx
I0518 18:18:31.2[20](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:21)444 13668 kubectl.go:94] kubectl apply --kubeconfig=/tmp/kind-cluser-eraser-e2e-test-kubecfg4[22](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:23)6[30](https://github.com/Azure/eraser/runs/6494557934?check_suite_focus=true#step:10:31)510 --namespace=eraser-system -f /home/runner/work/eraser/eraser/test/e2e/test-data/eraser_v1alpha1_imagelist.yaml
Error: The operation was canceled.
The most consistent failure at this point appears to be the imagelist_alias test.