K6 is stuck on stage: `initialization` if the init job fails
Brief summary
The operator creates the init job successfully, but if the pod fails for any reason, the operator doesn't notice it and the K6 job is stuck on the stage initialization until you manually remove it.
I'm willing to fix it (or at least to try it xD)
k6-operator version or image
latest (sha256:79df77fea27ab5820ce3f25167268d5094be2fc10d182283fce9921e3786fed1)
K6 YAML
Something that produces an error on init job
Other environment details (if applicable)
No response
Steps to reproduce the problem
Deploy a K6 manifest that produces a fail on the init pod. For example, linking a file that doesn't exist
Expected behaviour
The stage of K6 changes
Actual behaviour
The stage of K6 doesn't change
Hi @JorTurFer! Thanks for working on this. I wouldn't call it a bug but more of an improvement in logic, TBH :smile:
And this issue came up in several contexts recently! So linking key issues / PRs:
- #276
- https://github.com/grafana/k6-operator/issues/260
- https://github.com/grafana/k6-operator/issues/222
- #283
Quite a lot. I need to grok all of these to figure out what should be merged, changed, etc. It's in my TODO in the next couple of weeks so shouldn't be a long wait :+1: But as a heads up, there are duplicates and almost conflicts between the above.
I wouldn't call it a bug but more of an improvement in logic
I have to disagree because literally the test is f**ked up. I mean, any kind of error during the initializing will stuck the test without any useful feedback (nor useless feedback, it doesn't give any feedback at all) 😄 Any kind of automation over the test resource fails due to this. and it's quite annoying. Knowing the root cause, we have added more checks on different resources, but K6 resource's status isn't usable
Hello! Any update about this topic?
Hi @JorTurFer, apologies for such a delay. Yes, actually, it's a good time to make this addition for the next release, given past and future work, but I'll have to ask for an update of your PR. Will comment over there.
Is this related to where it gets stuck like this
time="2024-02-12T14:55:22Z" level=debug msg="Runner successfully initialized!"
time="2024-02-12T14:55:22Z" level=debug msg="Parsing CLI flags..."
time="2024-02-12T14:55:22Z" level=debug msg="Consolidating config layers..."
time="2024-02-12T14:55:22Z" level=debug msg="Parsing thresholds and validating config..."
time="2024-02-12T14:55:22Z" level=debug msg="Initializing the execution scheduler..."
time="2024-02-12T14:55:22Z" level=debug msg="Starting 2 outputs..." component=output-manager
time="2024-02-12T14:55:22Z" level=debug msg=Starting... output=InfluxDBv1
Init [ 0% ] Starting outputs
default [ 0% ]
and just does nothing after that?
My script is working locally, but when I try to run it in circleci this is all the further I get.
Hi @JorTurFer, apologies for such a delay. Yes, actually, it's a good time to make this addition for the next release, given past and future work, but I'll have to ask for an update of your PR. Will comment over there.
Sure, I'll rebase it this week and update the conflicts 😄
This appears to be fixed now, with PRs #291 and #401. Thanks @JorTurFer and @irumaru!
Some additional notes on the expected behaviour when initializer fails:
- With
cleanup: "post"option, k6-operator will delete resources pretty fast. So in order to observe it reliably, it is good to have a proper monitoring solution to watch logs and job / pod creation. - In cloud output mode, the test run will never get created in GCk6 so it won't appear in the UI. IOW, one has to check logs and metrics on their cluster to troubleshoot the error.
Nice! Sorry for being missing, my last weeks have been terrible :( Happy to see that it's solved 😄 Thanks!