k6 grpc: unstable spike test results

Brief summary

I'm trying to do a spike test for the grpc API of my service:

I want to find out the exact test parameters (VUs, iterations) that will make it fall;
the test should be around 10 seconds long

The problem is that scenario I created isn't stable - sometimes it causes the service to fail and sometimes not. For that reason, I can't determine precise crash load parameters. I read a bunch of k6 docs and articles, tried different scenarios but didn't succeed.

import grpc from 'k6/net/grpc';
import { check } from 'k6';
import { Rate, Counter } from "k6/metrics";

const error_counter = Counter("error_counter");
const error_rate = Rate("error_rate")
const req_counter = Counter("req_counter")

export const options = {
    // scenarios: {
    //     contacts: {
    //         executor: 'constant-vus',
    //         vus: 50,
    //         duration: '10s',
    //     },
    // },
    scenarios: {
        contacts: {
            executor: 'constant-arrival-rate',
            rate: 200,
            timeUnit: '1s',
            duration: '10s',
            preAllocatedVUs: 1000,
        },
    }
}

const client = new grpc.Client();
client.load(['definitions'], 'hello.proto');

export default () => {
    client.connect('localhost:9091', {
        plaintext: true
    });

    const req = { greeting: 'Bert' };
    const params = {
        tags: { name: 'mytag' },
        metadata: {'client-id': 'load-tester'}
    };

    for (let i = 0; i < 1000; i++) {
        const response = client.invoke('Prices/FetchPriceByArticleId', req, params);
        let success = check(response, {
            'status is OK': (r) => r && r.status === grpc.StatusOK,
        });
        req_counter.add(1);
        if (success) {
            error_rate.add(0);
        } else {
            error_rate.add(1);
            error_counter.add(1);
        }
    }
    client.close();
}

hello.proto

syntax = "proto2";

package hello;

service HelloService {
  rpc SayHello(HelloRequest) returns (HelloResponse);
  rpc LotsOfReplies(HelloRequest) returns (stream HelloResponse);
  rpc LotsOfGreetings(stream HelloRequest) returns (HelloResponse);
  rpc BidiHello(stream HelloRequest) returns (stream HelloResponse);
}

message HelloRequest {
  optional string greeting = 1;
}

message HelloResponse {
  required string reply = 1;
}

k6 version

k6 v0.37.0 ((devel), go1.17.8, darwin/amd64)

OS

macOS 12.1

Docker version and image (if applicable)

No response

Steps to reproduce the problem

Run the script.

Expected behaviour

The service fails with the same test parameters.

Actual behaviour

The service fails randomly with the same test parameters.

Apr 10 '22 15:04 kostiamol

I am not sure I can help you here, since this seems more likely to be caused by the system that you are testing and not k6, but here are some things you can try to make the test more consistent:

Get rid of the for loop in the default function, make only one request per iteration and just use a higher rate in the arrival-rate config. Right now every iteration will start sequentially making requests, i.e. every request after the first one will wait for the previous one to complete or fail. That means that the system-under-test somewhat determines the request pacing - if it starts responding more slowly you'll make fewer requests, the so called coordinated omission problem. See this for more details: https://k6.io/docs/using-k6/scenarios/arrival-rate/

connect() only once per VU, you can do by using the k6/execution API like this:

import exec from 'k6/execution';

// ...

export default function () {
    if (exec.vu.iterationInScenario == 0) {
        client.connect('localhost:9091', {
            plaintext: true
        });
    }

    // ...
    // you probably don't even need to explicitly disconnect for such a short test
}

make the test somewhat longer
export the detailed metric measurements to an external output (e.g. JSON or a CSV file) and see if anything strange is going on with k6 or your system

So far, there isn't enough data here to classify this as a k6 bug, it seems more likely to be a problem from how the test is written and how the tested system behaves. However, I'll keep the issue open for a while and will re-evaluate if you have any more details that will allow us to reproduce this problem on our side.

Apr 11 '22 08:04 na--

Thanks for the advice! What do you mean by making the test longer - duration or what? If I want to see the number of requests per second that crash my service that uses a particular amount of RAM - what duration should I specify for that?

Apr 11 '22 16:04 kostiamol

I agree, 10s is pretty short, and your data may have outliers. Here are some things you could try to isolate the issue:

Add think time with sleep() if that matches the load your application might get in production. With everything else the same, do you get more consistent results?
Monitor CPU and memory utilization on the machine you're running k6 from, so that you can see if there is a correlation between that and the service crashing. Make sure CPU and memory utilization are no higher than 80% for the duration of the test.
Lower the arrival rate and increase the duration (say, to 5m at 10 rps initially). Increase the rps until the service crashes. Try to get a baseline for what the service CAN handle consistently.
Use a constant VUs executor where you set VUs instead of rps. Does the service respond more consistently to VU levels than rps?
Add a ramp-up and ramp-down to see if your results are different when the load changes more gradually.
Use stages to run some lower constant load (that you've determined the service can handle) before spiking. Sometimes spiking the load from 0 (a cold start) can lead to mixed results.

Good luck!

Apr 12 '22 09:04 nicolevanderhoeven

Hello @kostiamol For capacity tests or breakpoint scenarios, it is recommended to do a gradual increase to better assess where is your fail point and breaking point. Some more info here: https://github.com/grafana/k6-workshop/blob/main/Modules/Load%20Testing.md#breakpoint-load While you are gradually ramping up your load pay close attention to the tested instance, and to the k6 machine. You want to identify which one is the bottleneck and make sure it is not the k6 generator. You just need constant increase in throughput until you identify those points. I would recommend using something like this: ''' scenarios: { contacts: { executor: 'ramping-vus', startVUs: 0, stages: [ { duration: '5m', target: 1000 }, ], gracefulRampDown: '0s', }, ''' Let us know for any questions. Gracias, Leandro

Apr 12 '22 14:04 srperf

Thanks for your answers guys! You definitely patched my view of the situation.

For now, the last issue I'm struggling with is that the service is running as a k8s pod and I'm trying to test it locally through port forwarding. When the load increases I get errors from k6 like:

ERRO[0062] GoError: context deadline exceeded
running at reflect.methodValueCall (native)
contactsat file:///Users/kostiamol/projects/local/script.js:62:41(14)
	at native  executor=constant-arrival-rate scenario=contacts source=stacktrace

and k8s answers with: E0412 15:11:51.330587 13410 portforward.go:306] error accepting connection on port 9091: accept tcp4 127.0.0.1:9091: accept: too many open files

Apr 12 '22 18:04 kostiamol

I think that the error you are getting is somewhat a timeout from the tested application. What are you using to measure that you are reaching the breakpoint of your service? I am not an expert on Go nether k8s, but both seem to be reporting issues, k6 timing out and k8s being unable to accept a new connection.

As a recommendation, make sure you are monitoring the tested element. Make sure you have defined what are your indicators for the degradation of the performance, and the breaking point of the system.

Hope that helps :)

Apr 13 '22 13:04 srperf

k6 k6 copied to clipboard

grpc: unstable spike test results

Brief summary

k6 version

OS

Docker version and image (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

k6
k6 copied to clipboard