k6-operator Handle k6 exit codes

Hi, i'm executing load tests in my kubernetes cluster but i have a problem when tests fails.

I need tests be executed only one time, and if these run succesfully o fails don't be executed again. Currently if tests running ok these dont be executed again, but if test threshold faild automatically starter container is created and launch another pod to try run test again.

I leave my config files here, i tried to set abortOnFail in threshold and use abortTest() function but the problem persist. I think it is a k6-operator behaviour, maybe you can help me.

This is my test file.

apiVersion: v1
kind: ConfigMap
metadata:
  name: k6-test
  namespace: k6-operator-system
data:
  test.js: |
    import http from 'k6/http';
    import { Rate } from 'k6/metrics';
    import { check, sleep, abortTest } from 'k6';

    const failRate = new Rate('failed_requests');

    export let options = {
      stages: [
        { target: 1, duration: '1s' },
        { target: 0, duration: '1s' },
      ],
      thresholds: {
        failed_requests: [{threshold: 'rate<=0', abortOnFail: true}],
        http_req_duration: [{threshold: 'p(95)<1', abortOnFail: true}],
      },
    };

    export default function () {
      const result = http.get('http://test/login/');
      check(result, {
        'http response status code is 200': result.status === 500,
      });
      failRate.add(result.status !== 200);
      sleep(1);
      abortTest();
    }

And this is my k6 definition.

apiVersion: k6.io/v1alpha1
kind: K6
metadata:
  name: k6-sample
  namespace: k6-operator-system
spec:
  parallelism: 1
  script:
    configMap:
      name: k6-test
      file: test.js
  arguments: --out influxdb=http://influxdb.influxdb:8086/test
  scuttle:
    enabled: "false"

I hope you can help me, thanks!

Sep 27 '21 16:09 b0nete

So, I think this is because k6 exits with a non 0 exit code and so the k6 operator will try to keep it going till it succeeds.

We could probably add that to the crd as an option, restart never. And have k6-operator interpret that.

Sep 29 '21 13:09 knechtionscoding

@b0nete thanks for opening the issue!

I agree with @KnechtionsCoding that this happens because of non-zero exit of k6 run. It seems that number of completions for the k8s job is 1 by default so operator expects at least one successful exit. Another curious thing is that I don't actually observe multiple test runs when I try this scenario: the 1st runner fails with non-zero exit, then the 2nd runner is created and gets stuck in "paused" state. This likely happens because the 1st starter finished successfully and operator doesn't have any additional logic for this case: no 2nd starter is created and the 2nd runner waits indefinitely to be started.

IMO, this shouldn't be the default behavior: if thresholds fail, it is a reason for someone to look into the SUT and the script and figure out what to do with that. So k6-operator shouldn't be restarting any pods on failing thresholds :thinking:

Dec 20 '21 13:12 yorugac

Looking at https://github.com/grafana/k6/blob/master/errext/exitcodes/codes.go:

k6 error	exit code	meaning in k6-operator context	restart the runner?	is startup-only error?
CloudTestRunFailed	97	this error should never happen in k6-operator	no	-
CloudFailedToGetProgress	98	this error should never happen in k6-operator	no-
ThresholdsHaveFailed	99	regular error, action is to be determined by user	no	-
SetupTimeout	100	regular error, likely the script or configuration needs to be reviewed	no	-
TeardownTimeout	101	regular error, likely the script or configuration needs to be reviewed	no	-
GenericTimeout	102	regular error, likely the script or configuration needs to be reviewed	no	-
GenericEngine	103	something going wrong in k6 setup and must be investigated	no
InvalidConfig	104	regular error, test config should be reviewed	no	-
ExternalAbort	105	`os.Interrupt`, `SIGINT` or `SIGTERM` are regular errors but everything else should never happen in k6-operator	yes*	no
CannotStartRESTAPI	106	runner cannot be started without working REST	yes	yes
ScriptException	107	regular error, script must be reviewed	no	-
ScriptAborted	108	regular error, script must be reviewed	no	-

~~unless there is a point in restart on SIGINT and SIGTERM specifically?~~ Other cases of ExternalAbort happen in k6 cloud execution which is not used in operator. During k6 run, ExternalAbort implies interrupts, SIGINTs and SIGTERMs.

EDIT 17 Feb: updated the table with Simme's input and additional info.

Dec 21 '21 11:12 yorugac

CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.
ExternalAbort is also (most) likely to happen due to timing/scheduling issues because of pod eviction policies being triggered, and there is a pretty high chance that rescheduling the job would resolve that.

Do note that I use the term reschedule rather than restart though. Restarting the exact same pod would likely lead to another f failure, but allowing k8s to destroy the pod and reschedule it (preferably even to another node) might not.

Dec 26 '21 23:12 simskij

CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.

Good point! There should be a limit to number of such restarts though.

Jan 05 '22 11:01 yorugac

In PR #86, backoff limit for runner jobs was set to 0: that disables all restarts no matter the exit codes. It's a partial solution to this issue. Cases when there should be a restart (as noted in above comments) should be solved separately.

Feb 18 '22 16:02 yorugac

Any progress on this? It blocks usage of the operator for me unfortunately. I'm thinking as a workaround, I could patch the job after the operator creates it.

Mar 30 '22 15:03 jsravn

Hi @jsravn, as described in the last comment before yours, this was partially fixed in https://github.com/grafana/k6-operator/pull/86/commits/0cdcc9d75f9b5cc1bf6bf53f8775c7102fc0e69a as part of PR #86. I expected that PR to be merged in by now but it's being delayed due to other issues :disappointed:

I'll pull out this specific commit with backoff tomorrow so that it can be merged into main branch independently from #86. Please watch for the updates :slightly_smiling_face:

Mar 31 '22 08:03 yorugac

Was this merged up? @yorugac

Apr 28 '22 23:04 mhaddon

@mhaddon yes, the fix is in main https://github.com/grafana/k6-operator/commit/278035580ffaa523b1a62f02e801fe7e35c7c5ab So the last image from main branch contains it.

Apr 29 '22 07:04 yorugac

what image is that? because i tried v0.0.7rc4 (https://github.com/grafana/k6-operator/tree/v0.0.7rc4/config/default) and it doesnt have it).

ghcr.io/grafana/operator:latest

or do i build it myself?

Apr 29 '22 10:04 mhaddon

No, you don't need to build it, it's present with commit as tag: ghcr.io/grafana/operator:278035580ffaa523b1a62f02e801fe7e35c7c5ab You can find all the images built for operator on this page: https://github.com/grafana/k6-operator/pkgs/container/operator

Apr 29 '22 12:04 yorugac

Connected issue in k6: https://github.com/grafana/k6/issues/2804

Mar 17 '23 15:03 yorugac

k6-operator k6-operator copied to clipboard

Handle k6 exit codes

k6-operator
k6-operator copied to clipboard