kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Test][Autoscaler] deflaky unexpected dead actors in tests by more resources

Open rueian opened this issue 5 months ago • 2 comments
trafficstars

Why are these changes needed?

Use more resources to deflaky. This can pass the flaky test 200 times without failures on my mac.

Related issue number

Closes #3701

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [ ] Unit tests
    • [x] Manual tests
    • [ ] This PR is not tested :(

rueian avatar Jun 01 '25 23:06 rueian

Why did you conclude that the flakiness is due to resource issues? The resource configuration is unexpectedly high compared to the configuration before #3707.

This makes me feel these PRs are hot fix instead of fixing the real root causes.

kevin85421 avatar Jun 02 '25 04:06 kevin85421

Why did you conclude that the flakiness is due to resource issues? The resource configuration is unexpectedly high compared to the configuration before #3707.

This makes me feel these PRs are hot fix instead of fixing the real root causes.

https://github.com/ray-project/kuberay/pull/3707 has already resolved the unexpected actor exit issue on my MacBook. However, the problem still occurs on the Buildkite CI runners. I’m wondering if this could be due to the runners having lower performance. What we could try now is increasing the resource requirement once again for the CI runners.

Yes, you are right. If this still doesn't fix the flakiness on the CI runners, we will need to find another way.

rueian avatar Jun 02 '25 05:06 rueian