k6-operator icon indicating copy to clipboard operation
k6-operator copied to clipboard

Make cloud output test runs resilient to operator's restarts

Open yorugac opened this issue 2 years ago • 0 comments

The test run with cloud output is not resilient towards external restart of operator's pod. This happens mainly due to the controller not storing its full state with cloud output execution. When operator is restarted by external actor, the flow of the controller may be broken in case of any test run; and in case of test run with cloud output specifically, it may lead to the test run being started but not finalized.

More precisely, FinishJobs is set to finalize always by timeout, regardless of the state of runner pods; since https://github.com/grafana/k6-operator/pull/86/commits/f08da61c27776c2fe89b325566751be5026ff059. But in case of restart of the operator's pod, the test run ID is lost and it's not possible to finalize the test. Full solution for such cases is to store the test run ID independently from the pod lifecycle, i.e. externally. Additionally, FinishJobs rely on cloud.InspectOutput.TotalDuration field which would also be lost in case of a restart.

yorugac avatar May 13 '22 14:05 yorugac