k6-operator K6 is stuck on stage: `initialization` if the init job fails

Brief summary

The operator creates the init job successfully, but if the pod fails for any reason, the operator doesn't notice it and the K6 job is stuck on the stage initialization until you manually remove it.

I'm willing to fix it (or at least to try it xD)

k6-operator version or image

latest (sha256:79df77fea27ab5820ce3f25167268d5094be2fc10d182283fce9921e3786fed1)

K6 YAML

Something that produces an error on init job

Other environment details (if applicable)

No response

Steps to reproduce the problem

Deploy a K6 manifest that produces a fail on the init pod. For example, linking a file that doesn't exist

Expected behaviour

The stage of K6 changes

Actual behaviour

The stage of K6 doesn't change

Sep 20 '23 13:09 JorTurFer

Hi @JorTurFer! Thanks for working on this. I wouldn't call it a bug but more of an improvement in logic, TBH :smile:

And this issue came up in several contexts recently! So linking key issues / PRs:

#276
https://github.com/grafana/k6-operator/issues/260
https://github.com/grafana/k6-operator/issues/222
#283

Quite a lot. I need to grok all of these to figure out what should be merged, changed, etc. It's in my TODO in the next couple of weeks so shouldn't be a long wait :+1: But as a heads up, there are duplicates and almost conflicts between the above.

Sep 25 '23 14:09 yorugac

I wouldn't call it a bug but more of an improvement in logic

I have to disagree because literally the test is f**ked up. I mean, any kind of error during the initializing will stuck the test without any useful feedback (nor useless feedback, it doesn't give any feedback at all) 😄 Any kind of automation over the test resource fails due to this. and it's quite annoying. Knowing the root cause, we have added more checks on different resources, but K6 resource's status isn't usable

Sep 25 '23 14:09 JorTurFer

Hello! Any update about this topic?

Feb 07 '24 22:02 JorTurFer

Hi @JorTurFer, apologies for such a delay. Yes, actually, it's a good time to make this addition for the next release, given past and future work, but I'll have to ask for an update of your PR. Will comment over there.

Feb 08 '24 12:02 yorugac

Is this related to where it gets stuck like this

time="2024-02-12T14:55:22Z" level=debug msg="Runner successfully initialized!"
time="2024-02-12T14:55:22Z" level=debug msg="Parsing CLI flags..."
time="2024-02-12T14:55:22Z" level=debug msg="Consolidating config layers..."
time="2024-02-12T14:55:22Z" level=debug msg="Parsing thresholds and validating config..."
time="2024-02-12T14:55:22Z" level=debug msg="Initializing the execution scheduler..."
time="2024-02-12T14:55:22Z" level=debug msg="Starting 2 outputs..." component=output-manager
time="2024-02-12T14:55:22Z" level=debug msg=Starting... output=InfluxDBv1

Init      [   0% ] Starting outputs
default   [   0% ]

and just does nothing after that?

My script is working locally, but when I try to run it in circleci this is all the further I get.

Feb 12 '24 14:02 alifemove

Hi @JorTurFer, apologies for such a delay. Yes, actually, it's a good time to make this addition for the next release, given past and future work, but I'll have to ask for an update of your PR. Will comment over there.

Sure, I'll rebase it this week and update the conflicts 😄

Feb 12 '24 14:02 JorTurFer

This appears to be fixed now, with PRs #291 and #401. Thanks @JorTurFer and @irumaru!

Some additional notes on the expected behaviour when initializer fails:

With cleanup: "post" option, k6-operator will delete resources pretty fast. So in order to observe it reliably, it is good to have a proper monitoring solution to watch logs and job / pod creation.
In cloud output mode, the test run will never get created in GCk6 so it won't appear in the UI. IOW, one has to check logs and metrics on their cluster to troubleshoot the error.

May 23 '24 15:05 yorugac

Nice! Sorry for being missing, my last weeks have been terrible :( Happy to see that it's solved 😄 Thanks!

May 26 '24 10:05 JorTurFer