kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Umbrella] Autoscaler improvements

Open kevin85421 opened this issue 1 year ago • 14 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

TODO:

  • Define the scope of Autoscaler V2 beta
  • List the important issues need to solve (V1 & V2).

This umbrella issue covers two topics:

  • Autoscaler V2 towards beta
  • Autoscaler stability improvements (V1 + V2)

Reliability

Top priority:

  • Autoscaler should not terminate worker Pods with running Actor / Task
  • Autoscaler should not crash because of CR spec
  • A Job should be able to finish.
  • [x] https://github.com/ray-project/ray/pull/48481
  • [x] https://github.com/ray-project/ray/issues/46172
  • [x] https://github.com/ray-project/kuberay/issues/2612
  • [x] https://github.com/ray-project/ray/issues/48950
  • [x] https://github.com/ray-project/ray/issues/40212
  • [x] https://github.com/ray-project/kuberay/issues/2385
  • [x] https://github.com/ray-project/ray/pull/48909
  • [x] https://github.com/ray-project/ray/pull/49150
  • [x] https://github.com/ray-project/ray/pull/48513
  • [x] https://github.com/ray-project/ray/pull/48519
  • [x] https://github.com/ray-project/ray/pull/48623
  • [x] https://github.com/ray-project/ray/pull/48541
  • [x] https://github.com/ray-project/ray/issues/50783
  • [x] https://github.com/ray-project/ray/issues/51321
  • [x] https://github.com/ray-project/ray/issues/50868
  • [x] https://github.com/ray-project/ray/issues/52264
  • [x] https://github.com/ray-project/ray/pull/52200
  • [x] https://github.com/ray-project/ray/pull/52409
  • [x] https://github.com/ray-project/ray/pull/52769
  • [ ] https://github.com/ray-project/ray/issues/35873 (GCS FT)
  • [ ] https://github.com/ray-project/ray/issues/51585
  • [ ] https://github.com/ray-project/ray/issues/50259
  • [ ] https://github.com/ray-project/ray/issues/45775
  • [ ] https://github.com/ray-project/ray/issues/45373
  • [ ] https://github.com/ray-project/ray/issues/40911
  • [ ] https://github.com/ray-project/ray/issues/36926
  • [ ] https://github.com/ray-project/ray/issues/35560

Usability

  • [x] https://github.com/ray-project/ray/pull/48813
  • [x] https://github.com/ray-project/ray/issues/49501
  • [ ] https://github.com/ray-project/ray/issues/49200
  • [ ] https://github.com/ray-project/ray/issues/47248
  • [ ] https://github.com/ray-project/ray/issues/36836

Testing

  • [ ] https://github.com/ray-project/kuberay/issues/2173

Observability / Debuggability

  • [x] https://github.com/ray-project/ray/pull/48905
  • [x] https://github.com/ray-project/ray/issues/37959
  • [x] https://github.com/ray-project/ray/issues/37856 (@ryanaoleary)
  • [x] https://github.com/ray-project/ray/pull/51192
  • [x] https://github.com/ray-project/ray/issues/52361
  • [ ] De-noise autoscaler logs. Currently, the autoscaler loops with Fetched pod data, outputting the state of the RayCluster even with no changes to the resources requested or allocated. This can make it fairly difficult to debug autoscaler logs. It'd be useful to provide the option to output relevant logs only on Autoscaler updates. (issue to be created)
    • Not urgent. We can set up env var to configure it.

Refactor

  • [x] https://github.com/ray-project/ray/pull/48919
  • [x] https://github.com/ray-project/ray/pull/48840
  • [x] https://github.com/ray-project/ray/pull/48566
  • [x] https://github.com/ray-project/ray/pull/49581
  • [x] https://github.com/ray-project/ray/pull/49812
  • [ ] Update doc for v2 if v2 is used by default

Backlogs

  • [ ] https://github.com/ray-project/kuberay/issues/2666 (relies on KubeRay v1.3)
  • [ ] https://github.com/ray-project/ray/issues/39735
    • install_ray is only used when disable_node_updaters is false, so KubeRay doesn't use this.
  • [ ] https://github.com/ray-project/ray/issues/39987
  • [ ] https://github.com/ray-project/ray/issues/37838

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

kevin85421 avatar Dec 04 '24 01:12 kevin85421

/assign @ryanaoleary

kevin85421 avatar Dec 04 '24 01:12 kevin85421

TODO: I'll leave a comment outlining the V2 beta scope and remaining issues to solve for V1 & V2.

ryanaoleary avatar Dec 04 '24 19:12 ryanaoleary

@ryanaoleary thanks! You can compile a list of issues, and we can schedule a meeting to go through them one by one.

kevin85421 avatar Dec 04 '24 19:12 kevin85421

The issues that I think we ought to complete before considering Autoscaler v2 in Beta can be broken down into observability improvements and reliability bug-fixes.

Observability:

  • Provide an API to persist worker logs after autoscaler scale-down, this is covered in the export API proposal (this may fall outside the scope of KubeRay v1.3)
  • De-noise autoscaler logs. Currently, the autoscaler loops with Fetched pod data, outputting the state of the RayCluster even with no changes to the resources requested or allocated. This can make it fairly difficult to debug autoscaler logs. It'd be useful to provide the option to output relevant logs only on Autoscaler updates. (issue to be created)
  • Direct formatting to output ray status: https://github.com/ray-project/ray/issues/37856
  • Refactor request resources in autoscaler logs: https://github.com/ray-project/ray/issues/37959
  • Generate node IDs: https://github.com/ray-project/ray/issues/19086
  • Expose API to expose autoscaler state: https://github.com/ray-project/ray/issues/37838

Reliability:

  • Maintain correctness of cluster state version: https://github.com/ray-project/ray/issues/35873
  • Reserve custom accelerators: https://github.com/ray-project/ray/issues/43079
  • Further v2 autoscaler tests in Ray Core and e2e v2 Autoscaler tests using the latest Ray image in KubeRay.
  • https://github.com/ray-project/ray/issues/39735

Several of the completed issues mentioned in the issue description fix the main reliability issues found within the v1 Autoscaler, and from my manual testing with multiple CPU and GPU worker-groups I've seen consistent behavior. Additionally, new features in the v2 autoscaler such as configuring idle node timeouts by node type will enable users with more fine-grained control of their workloads and minimize the amount of autoscaling errors we were previously seeing. We should also consider it a requirement to ensure reliable testing for CPUs, GPUs, and custom accelerators before considering v2 beta.

ryanaoleary avatar Dec 10 '24 08:12 ryanaoleary

TODO: confirm the following two issues.

  • https://github.com/ray-project/ray/issues/19086
  • https://github.com/ray-project/ray/issues/43079

kevin85421 avatar Dec 11 '24 21:12 kevin85421

https://github.com/ray-project/kuberay/issues/2999#issuecomment-2676146157 https://github.com/ray-project/kuberay/issues/2999#issuecomment-2649335659

kevin85421 avatar Feb 25 '25 23:02 kevin85421

Tracking issue for the e2e upgrade tests: https://github.com/ray-project/kuberay/issues/2561

ryanaoleary avatar Feb 26 '25 07:02 ryanaoleary

V2 Autoscaler issues and bug fixes:

  • https://github.com/ray-project/ray/issues/50259
  • https://github.com/ray-project/ray/issues/50868

ryanaoleary avatar Feb 26 '25 07:02 ryanaoleary

Add one more issue for autoscaler v2 https://github.com/ray-project/ray/issues/51321.

rueian avatar Mar 13 '25 01:03 rueian

Hey folks @kevin85421 , @rueian and @ryanaoleary great work here. I know I am late to the party but very interested in joining the team to support. Let me know if any task I can pick up. I have experience in k8s and observability.

Thanks @nadongjun for pointing me to this.

if you guys have any document/RFC/Design I can read that would be great.

bhks avatar Apr 11 '25 01:04 bhks

@bhks Thank you for reaching out! You can check the user guide and the design doc for more details. PRs are welcome. I suggest starting with Autoscaler V2 on KubeRay first. This is a high priority for me at the moment, so related PRs will be reviewed faster. In addition, starting with small PRs makes it easier for them to be merged.

kevin85421 avatar Apr 11 '25 04:04 kevin85421

Thank you @kevin85421, Do you have any task in mind which I can start with ?

bhks avatar Apr 11 '25 05:04 bhks

Add a new one https://github.com/ray-project/ray/issues/52361. I will open a PR soon.

rueian avatar Apr 16 '25 07:04 rueian

The scope of v1.4.0 has already been done. Update the tag v1.4.0 to v1.5.0.

kevin85421 avatar May 31 '25 00:05 kevin85421