kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

Adding serve support for Api server

Open blublinsky opened this issue 1 year ago • 15 comments
trafficstars

Why are these changes needed?

Extending API server to support serve

Related issue number

Checks

  • [x ] I've made sure the tests are passing.
  • Testing Strategy
    • [ x] Unit tests
    • [ x] Manual tests
    • [ ] This PR is not tested :(

blublinsky avatar Dec 11 '23 12:12 blublinsky

@tedhtchang, @z103cb please take a look

cc: @kevin85421

blublinsky avatar Dec 11 '23 12:12 blublinsky

@kevin85421 its ready for you

blublinsky avatar Dec 12 '23 07:12 blublinsky

The CI fails consistently. Would you mind taking a look? Thanks!

kevin85421 avatar Dec 12 '23 21:12 kevin85421

This test sometimes PASSED locally.

pytest --log-level=info -v -s tests/test_sample_rayservice_yamls.py::TestRayService::test_zero_downtime_rollout

Sometimes failed with

Failed to connect to rayservice-sample-serve-svc.default.svc.cluster.local port 8000: Connection refused
command terminated with exit code 7
....
INFO     framework.utils:utils.py:375 Execute command (check_output): kubectl exec curl -n default -- curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc.default.svc.cluster.local:8000/fruit/ -d '["MANGO", 2]'
INFO     framework.utils:utils.py:382 Exception: b''
====================================================

or

....
INFO     test_sample_rayservice_yamls:test_sample_rayservice_yamls.py:149 Ray service transitioned to status Running.
INFO     test_sample_rayservice_yamls:test_sample_rayservice_yamls.py:154 Ray service has moved to cluster "rayservice-sample-raycluster-xtvwr"
INFO     framework.utils:utils.py:375 Execute command (check_output): kubectl exec curl -n default -- curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc.default.svc.cluster.local:8000/fruit/ -d '["MANGO", 2]'
INFO     framework.utils:utils.py:379 Output: b'6'
============================================================================== short test summary info ==============================================================================
FAILED tests/test_sample_rayservice_yamls.py::TestRayService::test_zero_downtime_rollout - AssertionError

tedhtchang avatar Dec 13 '23 08:12 tedhtchang

ANd I do not think this PR changes anything there

blublinsky avatar Dec 13 '23 08:12 blublinsky

Hmm, I'm not referring to the test-rayservice-sample-yamls-latest-release. More than one CI check has failed. You may need to rebase the branch or open a new PR instead.

kevin85421 avatar Dec 14 '23 01:12 kevin85421

@kevin85421 we are all set

blublinsky avatar Dec 14 '23 14:12 blublinsky

Sorry @kevin85421 , I completely disagree with your statements, see point-by-point replies below

Perhaps the KubeRay API server should construct its own HTTP client instead of using the existing one in KubeRay. I do not think it makes sense to maintain 2 copies of identical code The dashboard HTTP client in KubeRay is not a user-facing API, so no one promises its backward compatibility. Constructing a new one in the KubeRay API server can improve the stability. It is not user-facing in the API server either. API Server uses these APIs to communicate with the Ray cluster and then convert them to the internal representation defined by its own APIs Reducing the dependencies between KubeRay API server and KubeRay can avoid a lot of issues for users and developers. For example, the KubeRay API server may want to support multiple versions of KubeRay and Ray. It is hard to achieve if there are a lot of dependencies between them. Theoretically, it might be right, but practically will lead to a huge duplication of the code. The API server does depend on the KubeRay operator The fact that these functions are not used by KubeRay means they are not tested by end-to-end tests and not utilized by users. Therefore, it is difficult to ascertain whether they work or not. That's why I was testing them. Besides, they are used by KubeRay operator Using the dashboard HTTP client in KubeRay for communication with Ray doesn't seem to make much sense to me. The KubeRay API server can directly talk to Ray. It does talk to Ray directly using the dashboard HTTP client as a code to do this

blublinsky avatar Dec 14 '23 18:12 blublinsky

@kevin85421, so where do we go from here. We disagree, should we ask for the third opinion?

blublinsky avatar Dec 15 '23 19:12 blublinsky

@kevin85421 we need to come to some kind of resolution here. Leaving it hanging is not a solution @architkulkarni ? Also, Ray job support is using the same class

blublinsky avatar Dec 20 '23 08:12 blublinsky

@kevin85421 it has been over a month. Any chance you can get to this?

blublinsky avatar Jan 26 '24 20:01 blublinsky

I will remove all functions in dashboard_httpclient.go that are not used by the KubeRay operator itself before the release of KubeRay v1.1.0. I don't want to have any unnecessary dependencies between KubeRay and KubeRay API server. In addition, I may also consider to move the API server to a separate repository.

We have 3 possible options to clean up dashboard_httpclient.go

  • I revert #1639.
  • I open a PR for a dashboard client for API server.
  • You can open a PR for a dashboard client for API server.

kevin85421 avatar Feb 01 '24 06:02 kevin85421

I will remove all functions in dashboard_httpclient.go that are not used by the KubeRay operator itself before the release of KubeRay v1.1.0. I don't want to have any unnecessary dependencies between KubeRay and KubeRay API server. In addition, I may also consider to move the API server to a separate repository.

There are 2 methods added there:

  1. DeployApplication(context.Context, string) error which was copied from driver, thus making it simpler and cleaner trivial 10 lines of code
  2. DeleteServeApplications(context.Context) error, which is 15 trivial lines of code

The biggest change is updating the return definition to align it with the definition of the serve implementation

Do whatever you have to do. This is taking way too long

blublinsky avatar Feb 01 '24 09:02 blublinsky

@kevin85421 moved dashboard_httpclient.go API server specific version and supporting files to the API server. Can we, please, finally merge it

blublinsky avatar Mar 01 '24 11:03 blublinsky

@kevin85421 the only change to the ray operator is in pod.go to capitalize initLivenessAndReadinessProbe to make it visible

blublinsky avatar Mar 01 '24 12:03 blublinsky