kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature][APIServer] add retry for http client

Open machichima opened this issue 7 months ago • 9 comments

  • Enable retry in executeRequest in pkg/http/client.go
  • Add unit test to ensure the retry function works as expected

Why are these changes needed?

Currently, the apiserver http client does not support retry. This PR enables the retry feature in http client.

Related issue number

Part of #3344

Checks

  • [ ] I've made sure the tests are passing.
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Manual tests
    • [ ] This PR is not tested :(

machichima avatar May 06 '25 01:05 machichima

Hi @dentiny , PTAL Thanks!

machichima avatar May 06 '25 15:05 machichima

let me know if you want a discussion, or you feel the PR is ready to review :)

dentiny avatar May 20 '25 23:05 dentiny

Hi @dentiny , I've modified to retry only on specific http status errors (list below) and added backoff.

  • 408 - Request Timeout
  • 429 - Too Many Requests
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout

For the overall timeout, I think it is now set when specifying http.Client before sending to NewKuberayAPIServerClient, do we also need to take care of that in NewKuberayAPIServerClient?

https://github.com/ray-project/kuberay/blob/5f593bc20b8d7f58f2f7b92d38bcef4b568ec321/apiserver/test/e2e/types.go#L95

Thanks!

machichima avatar May 21 '25 14:05 machichima

Hi @dentiny

I found that my original retry logic is incorrect. The max attempt times should be maxRetry + 1. Originally, I use < in the following for loop, so the attempt time is maxRetry. I modified as follow:

https://github.com/ray-project/kuberay/blob/7e0d6ad77528843ae0921cc193fbd1781c18b219/apiserver/pkg/http/client.go#L664

machichima avatar May 24 '25 04:05 machichima

Hi @dentiny

I found that my original retry logic is incorrect. The max attempt times should be maxRetry + 1. Originally, I use < in the following for loop, so the attempt time is maxRetry. I modified as follow:

https://github.com/ray-project/kuberay/blob/7e0d6ad77528843ae0921cc193fbd1781c18b219/apiserver/pkg/http/client.go#L664

Thanks! Then can we add a unit test with max retry count 0, to make sure functor does execute?

dentiny avatar May 24 '25 19:05 dentiny

@dentiny I modified the TestAPIServerClientRetry to let it validate the callCount is expectCallCount and add the test case for maxRetry = 0

Thanks!

machichima avatar May 25 '25 04:05 machichima

Hi @machichima , I didn't see new tests added for "maxRetry = 0" case (no new commits either). Did I mis-understand?

dentiny avatar May 25 '25 11:05 dentiny

Hi @machichima , I didn't see new tests added for "maxRetry = 0" case (no new commits either). Did I mis-understand?

I am so sorry... I forgot to push the commit. Just pushed

Thanks!

machichima avatar May 25 '25 11:05 machichima

cc @kevin85421 for final review

The failing CI for autoscaler seems to be flaky test, while this PR only modify apiserver part

machichima avatar May 27 '25 04:05 machichima

Hi @kevin85421 / @rueian, do you think this PR is good to go? I want to raise it because @kenmcheng 's working on a similar feature for v2: https://github.com/ray-project/kuberay/pull/3869/files, seems some of the util functions could be extracted and reused. Thanks!

dentiny avatar Jul 17 '25 17:07 dentiny

@kevin85421 Thank you! Just opened a doc issue here: https://github.com/ray-project/kuberay/issues/3883

machichima avatar Jul 17 '25 21:07 machichima