fix: fix too short timeout causing cascading failures
Why are these changes needed?
Hello,
The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.
In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.
Checks
- [ ] I've made sure the tests are passing.
- Testing Strategy
- [ ] Unit tests
- [ ] Manual tests
- [X] This PR is not tested in CI (hopefully your test suite will run once the PR is opened ? :) )
- [X] This PR has been tested in production
Screenshot from production, this is your ray software failing in production ;)
You can see the ray pod has been up for few hours and it is failing checks sporadically, because the timeout is too short.
I've encountered similar issues in the past from my testing, see https://github.com/ray-project/kuberay/issues/2355
We also increased the exec probe timeout for Head pod to 5s, so I am also open to increasing to 5s in worker pods https://github.com/ray-project/kuberay/pull/2353
Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (https://github.com/ray-project/kuberay/pull/2360), however, it's blocked on Ray unifiying health check endpoints https://github.com/ray-project/ray/issues/56204
@andrewsykim you mentioned that this may be merely an example file and the settings may be coming from somewhere else? I had a look but I can't find where the setting comes from.
do you think you can find the source and update ray? this is a fairly critical bug.
Hi @morotti, For KubeRay Operator I think it should be here: https://github.com/ray-project/kuberay/blob/530318b450939d6df033cefa89878a28eef85cba/ray-operator/controllers/ray/utils/constant.go#L212-L219
But you can also overwrite readinessProbe or livenessProbe in the yaml.