kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

fix: fix too short timeout causing cascading failures

Open morotti opened this issue 2 months ago • 5 comments

Why are these changes needed?

Hello,

The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.

In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.

Checks

  • [ ] I've made sure the tests are passing.
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Manual tests
    • [X] This PR is not tested in CI (hopefully your test suite will run once the PR is opened ? :) )
    • [X] This PR has been tested in production

morotti avatar Oct 17 '25 16:10 morotti

Screenshot from production, this is your ray software failing in production ;)

You can see the ray pod has been up for few hours and it is failing checks sporadically, because the timeout is too short.

image

morotti avatar Oct 17 '25 16:10 morotti

I've encountered similar issues in the past from my testing, see https://github.com/ray-project/kuberay/issues/2355

We also increased the exec probe timeout for Head pod to 5s, so I am also open to increasing to 5s in worker pods https://github.com/ray-project/kuberay/pull/2353

andrewsykim avatar Oct 17 '25 17:10 andrewsykim

Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (https://github.com/ray-project/kuberay/pull/2360), however, it's blocked on Ray unifiying health check endpoints https://github.com/ray-project/ray/issues/56204

andrewsykim avatar Oct 17 '25 17:10 andrewsykim

@andrewsykim you mentioned that this may be merely an example file and the settings may be coming from somewhere else? I had a look but I can't find where the setting comes from.

do you think you can find the source and update ray? this is a fairly critical bug.

morotti avatar Oct 23 '25 10:10 morotti

Hi @morotti, For KubeRay Operator I think it should be here: https://github.com/ray-project/kuberay/blob/530318b450939d6df033cefa89878a28eef85cba/ray-operator/controllers/ray/utils/constant.go#L212-L219

But you can also overwrite readinessProbe or livenessProbe in the yaml.

win5923 avatar Oct 29 '25 17:10 win5923