kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Core] Metric unintentional_worker_failures_total is not accurate

Open amir-f opened this issue 2 years ago • 1 comments
trafficstars

What happened + What you expected to happen

We use ray on Kubernetes using the kuberay project. We have a sanity test that runs a simple job via the job submission API the workload succeeds however the metric unintentional_worker_failures_total is also incremented.

That metric should not however be incremented. The definition of the metric reads Number of worker failures that are not intentional.

I asked about it on the slack channel and was told to file an issue.

Versions / Dependencies

2.6.1

Reproduction script

# workload.py

import ray


@ray.remote
def workload(val: str) -> str:
    return f"got {val}"


if __name__ == "__main__":
    ray.init()
    assert ray.get(workload.remote("foo")) == "got foo"
# submit.py

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient(ray_head_address)
job_id = client.submit_job(entrypoint='python workload.py')

Issue Severity

Low: It annoys or frustrates me.

amir-f avatar Sep 05 '23 18:09 amir-f

@anyscalesam This is unrelated to KubeRay and instead belongs to observability.

kevin85421 avatar Feb 21 '24 03:02 kevin85421