k6-operator
k6-operator copied to clipboard
Make cloud output test runs resilient to operator's restarts
The test run with cloud output is not resilient towards external restart of operator's pod. This happens mainly due to the controller not storing its full state with cloud output execution. When operator is restarted by external actor, the flow of the controller may be broken in case of any test run; and in case of test run with cloud output specifically, it may lead to the test run being started but not finalized.
More precisely, FinishJobs
is set to finalize always by timeout, regardless of the state of runner pods; since https://github.com/grafana/k6-operator/pull/86/commits/f08da61c27776c2fe89b325566751be5026ff059. But in case of restart of the operator's pod, the test run ID is lost and it's not possible to finalize the test. Full solution for such cases is to store the test run ID independently from the pod lifecycle, i.e. externally. Additionally, FinishJobs
rely on cloud.InspectOutput.TotalDuration
field which would also be lost in case of a restart.