k6-operator
k6-operator copied to clipboard
k6-operator manager freeze and don't recover after reaching timeout in initialize phase
Hi ! It seems that I've hit an annoying issue there ...
- I'm building k6 operator from master ( commit https://github.com/grafana/k6-operator/commit/c63b127b53ca18ddf98f1648e25aefb785efe207 )
How to reproduce :
- create a simple k6 custom resource for a test job (anything , any duration ... it doesn't have any impact)
- apply it on a kubernetes cluster running k6 operator
myjob-initializer-12345pod should be spawned immediately , starting downling the job worker container image- force image pulling to be very long (huge image , slow network, slow registry, ...) or force controller timeout threshold to be very short
Controller log "init & before timeout":
2022-07-09T08:30:53.878Z INFO controllers.K6 Reconcile(); stage = {"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z INFO controllers.K6 Initialize test {"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:53.879Z INFO controllers.K6 Changing stage of K6 status to initialization {"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:54.372Z INFO controllers.K6 Initializer job is ready to start with image `private.registry.foobar.local/public/load-tests-bundle` and command `[sh -c k6 archive /k6/job.js -O ./job.js.archived.tar --stage 60s:2 --stage 60s:4 --stage 120s:4 --stage 30s:0 && k6 inspect --execution-requirements ./job.js.archived.tar]` {"k6": "k6-operator-system/myjob"}
2022-07-09T08:30:56.182Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:01.184Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:06.185Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "k6-operator-system/k6-test-
[ ... ]
2022-07-09T08:31:56.194Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:56.276Z ERROR controllers.K6 Failed to initialize the script {"k6": "k6-operator-system/myjob", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
/workspace/controllers/k6_initialize.go:152
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2022-07-09T08:31:56.376Z ERROR controller Reconciler error {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
After hitting timeout, k6 controller will retry reconciling k6 resource several times ... so the resource seems initialized but nothing is happening next (which is not expected behaviour):
2022-07-09T08:31:57.382Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "k6-operator-system/myjob"}
2022-07-09T08:31:57.382Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:47.298Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:47.299Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:49.295Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:49.371Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
2022-07-09T08:32:51.275Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "k6-operator-system/myjob"}
2022-07-09T08:32:51.275Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "myjob", "namespace": "k6-operator-system"}
And that's all ... no more logs... The job is stalling (the intializer log output seems correct tho)
> kubectl get pods -n k6-operator-system
NAME READY STATUS RESTARTS AGE
k6-operator-controller-manager-6889f4fc-jnrwc 1/1 Running 0 17m
myjob-initializer-zb89p 0/1 Completed 0 15m
If you try to remove the K6 resource, it works. If you try to addthe same resource after that, everything works as expected since the (big) image is now already loaded on the only worker node I'm working against.
2022-07-09T08:47:33.271Z INFO controllers.K6 Request deleted. Nothing to reconcile. {"k6": "k6-operator-system/myjob"}
Analysis : It seems that the operator workflow is not really idempotent , so it's not really able to recover from any previous stage or step if the first reconciliation is failed for any reason.
In addition, custom resource stored state is not really helpful to understand what's going on. Maybe consider using official k8s metav1.Condition to store states and transitions directly inside custom resource live object.
thanks 👍
@BarthV, Thanks for the detailed report :+1: I don't remember observing a scenario exactly like yours before but there are definitely issues with the state right now. IME, the most nasty thing is that the state is not always being updated despite the transition happening de-facto but IIRC, I saw a couple of cases of "frozen" state as well. So your case sounds very similar. In the case of initializer, there is indeed a timeout here:
https://github.com/grafana/k6-operator/blob/c63b127b53ca18ddf98f1648e25aefb785efe207/controllers/k6_initialize.go#L64
but since the state hasn't changed, the function should repeat until it succeeds. The logic probably could be improved more and maybe timeout increased, but the fact that you don't see the repeat of Waiting... and other messages in your logs is surely weird.
One question from my side: what kind of image size do you have here + how long does it take to load from the scratch?
I think both the freezing and lack of state transitions might be manifestations of the same problem. Currently, I suspect there is some kind of hard-to-detect race with the state update cycle. IOW, improving state transitions is a planned work for the operator :joy: Btw, thanks for metav1.Condition - I'll look into it.
Let's keep this open as a bug for now: once there are improvements in how state is handled, it makes sense to re-check this scenario as well.
My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser. The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.
We too have noticed the k6-operator will fail to start the load test and only generates the starter as of late. We found restarting the operator pod fixes this, but it's starting to become a problem for our workloads. Logs are almost identical as above.
It seems the last update completely broke our ability to do multiple load tests at once. Any way we can revert to an older release?
manager 2022-07-26T17:57:45.214Z ERROR controllers.K6 Failed to start all jobs {"k6": "k6-workloads/jkidd-prod", "error": "timed out waiting for the condition"} │
│ manager github.com/go-logr/zapr.(*zapLogger).Error
Hello @BarthV
Are you running the setup in EKS with Fargate or GKE autopilot by any chance?
Hello @BarthV
Are you running the setup in EKS with Fargate or GKE autopilot by any chance?
no, I'm running self manager kube clusters backed by a french CSP VM infra.
My k6 image is about ~450MB ... mainly because i'm embedding chromium for browser automation using xk6-browser. The real root cause on our side is due to our cloud provider S3 reaaaaallly poor performance last weeks. So it slows down our internal container registry really bad.
Hi @BarthV .
Are you using custom your own k6 image, that inside have xk6-browser and chromium inside?
Then you use your custom-k6-image as the k6-operator's Runner?
I was trying run xk6-browser test script on k8s using this k6-operator way, but fail. Because the runner using k6 instead of xk6-browser to run the script.
Hello,
Trying to use EKS Fargate, I get the same behaviour:
2022-12-05T18:25:06.909Z INFO controllers.K6 Reconcile(); stage = {"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z INFO controllers.K6 Initialize test {"k6": "load-testing/load-test"}
2022-12-05T18:25:06.909Z INFO controllers.K6 Changing stage of K6 status to initialization {"k6": "load-testing/load-test"}
2022-12-05T18:25:06.920Z INFO controllers.K6 Initializer job is ready to start with image `ghcr.io/grafana/operator:latest-runner` and command `[sh -c k6 archive --log-output=none /test/load-test.js -O ./load-test.js.archived.tar && k6 inspect --execution-requirements --log-output=none ./load-test.js.archived.tar]` {"k6": "load-testing/load-test"}
2022-12-05T18:25:06.933Z INFO controllers.K6 No initializing pod found yet {"k6": "load-testing/load-test"}
2022-12-05T18:25:11.933Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:16.933Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:21.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:26.933Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:31.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:36.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:41.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:46.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:51.933Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:25:56.933Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:26:01.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z INFO controllers.K6 Waiting for initializing pod to finish {"k6": "load-testing/load-test"}
2022-12-05T18:26:06.934Z ERROR controllers.K6 Failed to initialize the script {"k6": "load-testing/load-test", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/grafana/k6-operator/controllers.InitializeJobs
/workspace/controllers/k6_initialize.go:153
github.com/grafana/k6-operator/controllers.(*K6Reconciler).Reconcile
/workspace/controllers/k6_controller.go:63
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2022-12-05T18:26:06.934Z ERROR controller Reconciler error {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing", "error": "timed out waiting for the condition"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:237
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:209
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:188
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2022-12-05T18:26:07.935Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "load-testing/load-test"}
2022-12-05T18:26:07.935Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:11.059Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "load-testing/load-test"}
2022-12-05T18:26:11.059Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:35.114Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "load-testing/load-test"}
2022-12-05T18:26:35.114Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
2022-12-05T18:26:37.103Z INFO controllers.K6 Reconcile(); stage = initialization {"k6": "load-testing/load-test"}
2022-12-05T18:26:37.103Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "k6.io", "reconcilerKind": "K6", "controller": "k6", "name": "load-test", "namespace": "load-testing"}
Same issue, while using Karpenter the time to start the initialiser pod is higher and because of that the timeout happened first and thus no tests are submitted
I think this issue should be resolved since v0.0.9rc2: that's when the timeout via PollImmediate was removed.
@BarthV could you confirm if your case is resolved with the latest versions (v0.0.9rc2 or v0.0.9rc3 or just latest)?