ray
ray copied to clipboard
[core][autoscaler][v1] drop object_store_memory from ResourceDemandScheduler._update_node_resources_from_runtime
Why are these changes needed?
As mentioned in the issue https://github.com/ray-project/ray/issues/53027, autoscaler chooses the wrong node type when scaling up the second time:
Say we have the following available_node_types:
available_node_types:
ray.worker.4090.standard:
min_workers: 0
max_workers: 5
resources: {"CPU": 16, "GPU": 1, "memory": 30107260928, "gram": 24}
node_config: {}
ray.worker.4090.highmem:
min_workers: 0
max_workers: 5
resources: {"CPU": 16, "GPU": 1, "memory": 62277025792, "gram": 24}
node_config: {}
And our cluster already has a fully busy ray.worker.4090.standard node. Then, if we request a new task that requires resources matches ray.worker.4090.standard, the current autoscaler will unexpectedly launch a new ray.worker.4090.highmem node instead of a new ray.worker.4090.standard node.
Root Cause
The root cause is that when there is already a ray.worker.4090.standard node in the cluster, the autoscaler will try to fetch "object_store_memory" from the node and merge it into the node resources definition. https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379
Then the merged resources definition will be used for calculating a score for choosing a node type when scaling up. However, "object_store_memory" now becomes an unused resource in the ray.worker.4090.standard that makes it less preferable for the autoscaler and results in choosing ray.worker.4090.highmem instead.
Solution
Since "object_store_memory" for task/actor scheduling was already deprecated almost 3 years ago in https://github.com/ray-project/ray/pull/26252, I think we can just stop fetching "object_store_memory" from the node.
This PR removes "object_store_memory" from https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379
and adds a test case based on the case provided in the issue https://github.com/ray-project/ray/issues/53027.
Related issue number
Closes https://github.com/ray-project/ray/issues/53027
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
cc @kevin85421 for review.
This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.
You can always ask for help on our discussion forum or Ray's public slack channel.
If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Not stale
Offline discussion: num_matching_resource_types is useless. @rueian will open a follow up PR to remove it.
Offline discussion:
Issue statement
-
If a node type has running instances, the instance's
memoryandobject_store_memorywill be added tonode_types[node_type]["resources"], unless the user has explicitly specified them in theray startcommand. https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379 -
If a node type doesn't have any running instances,
memoryandobject_store_memorywill not be taken into consideration when computing scores for node types to determine which one is best to scale up. -
- The function
_resource_based_utilization_scorerreturns four values:- (1)
gpu_ok: avoid scheduling non-GPU workloads to GPU nodes - (2)
num_matching_resource_types: The number of resource types that are specified in both the node type and the resource demands. For example, if a node has{A: 10, B: 20, C: 30, memory: 1000, object_store_memory: 1000}and the resource demands are{A: 1, B: 1, C: 1}, the value ofnum_matching_resource_typeswill be 3. - (3)
min(util_by_resources): the minimum resource utilization of all resource types.- In this issue,
ray.worker.4090.standardhas a running instance, and the resourceobject_store_memoryhas the lowest utilization, 0, because Ray tasks currently do not support scheduling withobject_store_memory. Forray.worker.4090.highmem, because it doesn't have any running instances, the score will only take"CPU": 16, "GPU": 1, "memory": 62277025792, "gram": 24into consideration, and the utilization is not zero, soray.worker.4090.highmemwill be selected to scale up.
- In this issue,
# Prioritize avoiding gpu nodes for non-gpu workloads first, # then prioritize matching multiple resource types, # then prioritize using all resources, # then prioritize overall balance of multiple resources. return ( gpu_ok, num_matching_resource_types, min(util_by_resources), # util_by_resources should be non empty float(sum(util_by_resources)) / len(util_by_resources), ) - (1)
- The function
To add, the motivation of min(util_by_resources) is to prefer not to launch a node with resources that are not needed by the task. For example, if a task asks for 2 GPUs, and then to choose from the following node types to scale up:
A: [GPU: 6]
B: [GPU: 2, TPU: 1]
Node type A should still be selected.