kuberay
kuberay copied to clipboard
[RayJob][Feature] add light weight job submitter in kuberay image
Why are these changes needed?
Currently, noted in the issue https://github.com/ray-project/kuberay/issues/2537, when a user comes with a RayJob CR, KubeRay uses the same image as the RayCluster to start another container to submit the Ray Job. However, if the container runs on a node without the image preloaded, it takes a long time to download the image and start since the image is usually large.
This PR adds a light submitter (45MB) that mimics the ray job submit behavior (submit + tail logs) into the KubeRay image which is usually smaller than the image used in the RayCluster. Users can try it with the submitterPodTemplate in their RayJob CR.
Example RayJob CR yaml:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
spec:
rayClusterSpec:
...
submitterPodTemplate:
spec:
restartPolicy: Never
containers:
- name: my-custom-rayjob-submitter-pod
image: kuberay/submitter:nightly
args: ["--runtime-env-json", '{"pip":["requests==2.26.0","pendulum==2.1.2"],"env_vars":{"counter_name":"test_counter"}}', "--", "python", "/home/ray/samples/sample_code.py"]
And, this submitter will not fail when the job has already been submitted thus will also solve https://github.com/ray-project/kuberay/issues/2154.
Related issue number
https://github.com/ray-project/kuberay/issues/2537
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
Are we planning to merge this PR and https://github.com/ray-project/kuberay/pull/2579?
I would suggest holding on this until we hear feedback about https://github.com/ray-project/kuberay/pull/2579 and whether it addresses concerns with RayJob
Just to note that although both PRs can solve the duplicate submission issue, this lightweight committer can further shorten startup duration by a smaller image.
Makes sense, but I'm concerned about kuberay operator image becoming a dependency at the cluster / job level. If we think this is worth doing, we should probably create a new image
Hi @kevin85421, I have used a new GitHub action job to build a dedicated image for the submitter but the job requires credentials which I believe are only available after merging the PR.
Do you have any suggested way to test the GitHub action job before merging the PR? Or probably we just merge it first?
IMO I don't think we need this with https://github.com/ray-project/kuberay/pull/2579 merged. Or at least we can revisit after v1.3 based on user feedback
IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback
The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary.
IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback
The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary.
@kevin85421 any news about this feature? It would be very useful on a AWS Fargate environment, because images are always pulled whenever a new Ray POD is created.
@kevin85421 We have encountered similar problems recently. And triggers an exponential backoff retry of controller-runtime. Due to the long mirror pulling time of the submitter, the return of GetRayjobInfo is always 404 and is constantly re-queued. However, the retry interval after multiple times reaches 5 minutes. Even if the submitter is ready, the state of rayjob cr cannot flow (waiting for retry).I think it will affect the efficiency of task state machine flow, so I hope to improve the priority.
close this since @owenowenisme has done here https://github.com/ray-project/kuberay/pull/3943
信件已收到 【自动回复】