kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

Add Fake TPU e2e Autoscaling Test Cases

Open ryanaoleary opened this issue 6 months ago • 2 comments

Why are these changes needed?

This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host and multi-host TPU autoscaling behave as expected. The behaviors tested included:

  • (1) Creating a detached actor that requests resources: {"TPU": 4} will scale up a Ray TPU worker
  • (2) For a multi-host worker group, the number of workers created should equal replicas * numOfHosts
  • (3) Terminating detached actors scheduled on a multi-host worker group replica will cause the entire replica to be scaled down

Edit: Removed test behavior for idle nodes being scaled down, since this requires setting the timeout value to a much higher value and scaling down of multi-host replicas is still tested.

Related issue number

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [x] Unit tests
    • [x] Manual tests
    • [ ] This PR is not tested :(

ryanaoleary avatar Jul 31 '24 02:07 ryanaoleary