k6-operator icon indicating copy to clipboard operation
k6-operator copied to clipboard

Handle k6 exit codes

Open b0nete opened this issue 3 years ago • 13 comments

Hi, i'm executing load tests in my kubernetes cluster but i have a problem when tests fails.

I need tests be executed only one time, and if these run succesfully o fails don't be executed again. Currently if tests running ok these dont be executed again, but if test threshold faild automatically starter container is created and launch another pod to try run test again.

I leave my config files here, i tried to set abortOnFail in threshold and use abortTest() function but the problem persist. I think it is a k6-operator behaviour, maybe you can help me.

This is my test file.

apiVersion: v1
kind: ConfigMap
metadata:
  name: k6-test
  namespace: k6-operator-system
data:
  test.js: |
    import http from 'k6/http';
    import { Rate } from 'k6/metrics';
    import { check, sleep, abortTest } from 'k6';

    const failRate = new Rate('failed_requests');

    export let options = {
      stages: [
        { target: 1, duration: '1s' },
        { target: 0, duration: '1s' },
      ],
      thresholds: {
        failed_requests: [{threshold: 'rate<=0', abortOnFail: true}],
        http_req_duration: [{threshold: 'p(95)<1', abortOnFail: true}],
      },
    };

    export default function () {
      const result = http.get('http://test/login/');
      check(result, {
        'http response status code is 200': result.status === 500,
      });
      failRate.add(result.status !== 200);
      sleep(1);
      abortTest();
    }

And this is my k6 definition.

apiVersion: k6.io/v1alpha1
kind: K6
metadata:
  name: k6-sample
  namespace: k6-operator-system
spec:
  parallelism: 1
  script:
    configMap:
      name: k6-test
      file: test.js
  arguments: --out influxdb=http://influxdb.influxdb:8086/test
  scuttle:
    enabled: "false"

I hope you can help me, thanks!

b0nete avatar Sep 27 '21 16:09 b0nete

So, I think this is because k6 exits with a non 0 exit code and so the k6 operator will try to keep it going till it succeeds.

We could probably add that to the crd as an option, restart never. And have k6-operator interpret that.

knechtionscoding avatar Sep 29 '21 13:09 knechtionscoding

@b0nete thanks for opening the issue!

I agree with @KnechtionsCoding that this happens because of non-zero exit of k6 run. It seems that number of completions for the k8s job is 1 by default so operator expects at least one successful exit. Another curious thing is that I don't actually observe multiple test runs when I try this scenario: the 1st runner fails with non-zero exit, then the 2nd runner is created and gets stuck in "paused" state. This likely happens because the 1st starter finished successfully and operator doesn't have any additional logic for this case: no 2nd starter is created and the 2nd runner waits indefinitely to be started.

IMO, this shouldn't be the default behavior: if thresholds fail, it is a reason for someone to look into the SUT and the script and figure out what to do with that. So k6-operator shouldn't be restarting any pods on failing thresholds :thinking:

yorugac avatar Dec 20 '21 13:12 yorugac

Looking at https://github.com/grafana/k6/blob/master/errext/exitcodes/codes.go:

k6 error exit code meaning in k6-operator context restart the runner? is startup-only error?
CloudTestRunFailed 97 this error should never happen in k6-operator no -
CloudFailedToGetProgress 98 this error should never happen in k6-operator no-
ThresholdsHaveFailed 99 regular error, action is to be determined by user no -
SetupTimeout 100 regular error, likely the script or configuration needs to be reviewed no -
TeardownTimeout 101 regular error, likely the script or configuration needs to be reviewed no -
GenericTimeout 102 regular error, likely the script or configuration needs to be reviewed no -
GenericEngine 103 something going wrong in k6 setup and must be investigated no
InvalidConfig 104 regular error, test config should be reviewed no -
ExternalAbort 105 os.Interrupt, SIGINT or SIGTERM are regular errors but everything else should never happen in k6-operator yes* no
CannotStartRESTAPI 106 runner cannot be started without working REST yes yes
ScriptException 107 regular error, script must be reviewed no -
ScriptAborted 108 regular error, script must be reviewed no -
  • ~~unless there is a point in restart on SIGINT and SIGTERM specifically?~~ Other cases of ExternalAbort happen in k6 cloud execution which is not used in operator. During k6 run, ExternalAbort implies interrupts, SIGINTs and SIGTERMs.

EDIT 17 Feb: updated the table with Simme's input and additional info.

yorugac avatar Dec 21 '21 11:12 yorugac

  • CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.
  • ExternalAbort is also (most) likely to happen due to timing/scheduling issues because of pod eviction policies being triggered, and there is a pretty high chance that rescheduling the job would resolve that.

Do note that I use the term reschedule rather than restart though. Restarting the exact same pod would likely lead to another f failure, but allowing k8s to destroy the pod and reschedule it (preferably even to another node) might not.

simskij avatar Dec 26 '21 23:12 simskij

  • CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.

Good point! There should be a limit to number of such restarts though.

yorugac avatar Jan 05 '22 11:01 yorugac

In PR #86, backoff limit for runner jobs was set to 0: that disables all restarts no matter the exit codes. It's a partial solution to this issue. Cases when there should be a restart (as noted in above comments) should be solved separately.

yorugac avatar Feb 18 '22 16:02 yorugac

Any progress on this? It blocks usage of the operator for me unfortunately. I'm thinking as a workaround, I could patch the job after the operator creates it.

jsravn avatar Mar 30 '22 15:03 jsravn

Hi @jsravn, as described in the last comment before yours, this was partially fixed in https://github.com/grafana/k6-operator/pull/86/commits/0cdcc9d75f9b5cc1bf6bf53f8775c7102fc0e69a as part of PR #86. I expected that PR to be merged in by now but it's being delayed due to other issues :disappointed:

I'll pull out this specific commit with backoff tomorrow so that it can be merged into main branch independently from #86. Please watch for the updates :slightly_smiling_face:

yorugac avatar Mar 31 '22 08:03 yorugac

Was this merged up? @yorugac

mhaddon avatar Apr 28 '22 23:04 mhaddon

@mhaddon yes, the fix is in main https://github.com/grafana/k6-operator/commit/278035580ffaa523b1a62f02e801fe7e35c7c5ab So the last image from main branch contains it.

yorugac avatar Apr 29 '22 07:04 yorugac

what image is that? because i tried v0.0.7rc4 (https://github.com/grafana/k6-operator/tree/v0.0.7rc4/config/default) and it doesnt have it).

ghcr.io/grafana/operator:latest

or do i build it myself?

mhaddon avatar Apr 29 '22 10:04 mhaddon

No, you don't need to build it, it's present with commit as tag: ghcr.io/grafana/operator:278035580ffaa523b1a62f02e801fe7e35c7c5ab You can find all the images built for operator on this page: https://github.com/grafana/k6-operator/pkgs/container/operator

yorugac avatar Apr 29 '22 12:04 yorugac

Connected issue in k6: https://github.com/grafana/k6/issues/2804

yorugac avatar Mar 17 '23 15:03 yorugac