kuttl assert --timeout can be extremely inaccurate
What happened:
I ran the following command on an assert file containing 5 objects (2 custom resources and 3 Deployments):
$ kubectl-kuttl assert --namespace $NAMESPACE --timeout 300 $assert_file
It took 8.5h to exit!!! In the process it printed the following:
I0910 12:34:59.593248 31598 request.go:655] Throttling request took 1.049689212s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/events.k8s.io/v1beta1?timeout=32s
I0910 12:35:09.780559 31598 request.go:655] Throttling request took 1.048845467s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/oauth.openshift.io/v1?timeout=32s
I0910 12:35:20.763484 31598 request.go:655] Throttling request took 1.001150203s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/storage.k8s.io/v1?timeout=32s
[... snip some lines]
I0910 12:42:25.478841 31598 request.go:655] Throttling request took 1.348574591s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/route.openshift.io/v1?timeout=32s
E0910 12:46:39.965965 31598 request.go:1011] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0910 12:46:39.966218 31598 request.go:1011] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0910 12:46:39.965456 31598 request.go:1011] Unexpected error when reading response body: context deadline exceeded
E0910 12:46:39.966595 31598 request.go:1011] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while reading body)
E0910 12:46:39.966596 31598 request.go:1011] Unexpected error when reading response body: context deadline exceeded (Client.Timeout or context cancellation while reading body)
I0910 12:46:54.735503 31598 request.go:655] Throttling request took 1.000308781s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/operators.coreos.com/v1?timeout=32s
I0910 12:47:04.771207 31598 request.go:655] Throttling request took 1.300070296s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/coordination.k8s.io/v1beta1?timeout=32s
[... snip lots of lines]
I0910 18:58:46.273936 31598 request.go:655] Throttling request took 1.04966279s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/packages.operators.coreos.com/v1?timeout=32s
I0910 18:58:56.324659 31598 request.go:655] Throttling request took 1.302218545s, request: GET:https://api.mo-09-10a.openshift.infra.rox.systems:6443/apis/packages.operators.coreos.com/v1?timeout=32s
fatal error getting client: Get "https://api.mo-09-10a.openshift.infra.rox.systems:6443/api?timeout=32s": dial tcp: lookup api.mo-09-10a.openshift.infra.rox.systems: no such host
fatal error getting client: Get "https://api.mo-09-10a.openshift.infra.rox.systems:6443/api?timeout=32s": dial tcp: lookup api.mo-09-10a.openshift.infra.rox.systems: no such host
fatal error getting client: Get "https://api.mo-09-10a.openshift.infra.rox.systems:6443/api?timeout=32s": dial tcp: lookup api.mo-09-10a.openshift.infra.rox.systems: no such host
fatal error getting client: Get "https://api.mo-09-10a.openshift.infra.rox.systems:6443/api?timeout=32s": dial tcp: lookup api.mo-09-10a.openshift.infra.rox.systems: no such host
fatal error getting client: Get "https://api.mo-09-10a.openshift.infra.rox.systems:6443/api?timeout=32s": dial tcp: lookup api.mo-09-10a.openshift.infra.rox.systems: no such host
Error: asserts not valid
The command exited at about 20:03.
What you expected to happen:
The command should exit in about five minutes.
How to reproduce it (as minimally and precisely as possible):
Run the command on an assert file with a bunch of objects.
Anything else we need to know?:
I took a look at the implementation and it seems to assume that the CheckResource() call time is negligible, and simply iterates --timeout times, sleeping one second in between.
I did some printf-style debugging post-facto, and in my experiments found that across the Atlantic, with an assert file with 5 objects, each iteration of the assert loop takes about 17.5 seconds (3 seconds per CheckResource() plus the sleep). This is already an order of magnitude more than expected. However as shown by the extreme initial example, sometimes this can be much longer.
Environment:
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0+9689d22", GitCommit:"9689d22dc3121e14688bbb057e3a09efb6cb66aa", GitTreeState:"clean", BuildDate:"2021-08-27T16:53:43Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
- KUTTL version (use
kubectl kuttl version): 0.11.1 - Cloud provider or hardware configuration: OpenShift 4 on GCP