k6-operator icon indicating copy to clipboard operation
k6-operator copied to clipboard

Monitor runners in case of insufficient resources

Open yorugac opened this issue 1 year ago • 5 comments

Feature Description

When one of the runners does not have sufficient resources allocated for the test, it goes into OOM state (insufficient memory for VUs. There can be other types of error for the same case as well). This condition is not monitored by the operator in any way, resulting in infinite wait loop for the pods to bootstrap.

This case should be monitored by the operator, followed by abortion of the test.

Suggested Solution (optional)

By initial experiments, there are two loops that can become infinite in such cases, at stage = "created" and stage = "started".

Note that test runs in different modes need to handled this case differently.

Already existing or connected issues / PRs (optional)

Potentially connected issue: https://github.com/grafana/k6-operator/issues/222

yorugac avatar Jul 20 '23 08:07 yorugac

Some of the same considerations I mentioned about setup() and teardown() in https://github.com/grafana/k6-operator/issues/223#issuecomment-1643722499 may also apply here :thinking: Though maybe not entirely, since for the best UX, I imagine it would be best to rely on both k6 and k8s for error handling :thinking:

na-- avatar Jul 20 '23 11:07 na--

I believe we're encountering infinite loop issue in version v0.0.10rc3.

We have hard limits set for K8S namespaces (CPU/Memory/Max number of pods). If a test setup violates the aforementioned limits, it results in an infinite loop. For instance, if someone sets parallelism to a number that exceeds the Max number of pods policy, the scheduled runner's pods end up in an infinite "running" state loop.

freevatar avatar Jul 27 '23 23:07 freevatar

@freevatar thanks! Your case is a "perfect" example of this problem. One thing I'd like to clarify: since you pointed out the version, did you not encounter this problem in previous versions, like v0.0.10rc2, etc.?

yorugac avatar Jul 28 '23 11:07 yorugac

@na-- I've missed your comments :facepalm: thank you! But yes, this particular case is more about "Kubernetes level" UX rather than k6. Either way, it is in my TODO plans to go through your distributed updates in k6 repo - I'll comment then :+1:

yorugac avatar Jul 28 '23 11:07 yorugac

@yorugac

did you not encounter this problem in previous versions, like v0.0.10rc2, etc.?

Sorry for confusion, what I meant is that we tested the latest available version as well.

freevatar avatar Jul 28 '23 12:07 freevatar