ray icon indicating copy to clipboard operation
ray copied to clipboard

[core][autoscaler][v1] drop object_store_memory from ResourceDemandScheduler._update_node_resources_from_runtime

Open rueian opened this issue 6 months ago • 3 comments

Why are these changes needed?

As mentioned in the issue https://github.com/ray-project/ray/issues/53027, autoscaler chooses the wrong node type when scaling up the second time:

Say we have the following available_node_types:

available_node_types:
  ray.worker.4090.standard:
    min_workers: 0
    max_workers: 5
    resources: {"CPU": 16, "GPU": 1, "memory": 30107260928, "gram": 24}
    node_config: {}

  ray.worker.4090.highmem:
    min_workers: 0
    max_workers: 5
    resources: {"CPU": 16, "GPU": 1, "memory": 62277025792, "gram": 24}
    node_config: {}

And our cluster already has a fully busy ray.worker.4090.standard node. Then, if we request a new task that requires resources matches ray.worker.4090.standard, the current autoscaler will unexpectedly launch a new ray.worker.4090.highmem node instead of a new ray.worker.4090.standard node.

Root Cause

The root cause is that when there is already a ray.worker.4090.standard node in the cluster, the autoscaler will try to fetch "object_store_memory" from the node and merge it into the node resources definition. https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379

Then the merged resources definition will be used for calculating a score for choosing a node type when scaling up. However, "object_store_memory" now becomes an unused resource in the ray.worker.4090.standard that makes it less preferable for the autoscaler and results in choosing ray.worker.4090.highmem instead.

Solution

Since "object_store_memory" for task/actor scheduling was already deprecated almost 3 years ago in https://github.com/ray-project/ray/pull/26252, I think we can just stop fetching "object_store_memory" from the node.

This PR removes "object_store_memory" from https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379

and adds a test case based on the case provided in the issue https://github.com/ray-project/ray/issues/53027.

Related issue number

Closes https://github.com/ray-project/ray/issues/53027

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

rueian avatar May 23 '25 21:05 rueian

cc @kevin85421 for review.

rueian avatar May 24 '25 04:05 rueian

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 17 '25 00:06 github-actions[bot]

Not stale

rueian avatar Jun 17 '25 00:06 rueian

Offline discussion: num_matching_resource_types is useless. @rueian will open a follow up PR to remove it.

kevin85421 avatar Jul 14 '25 20:07 kevin85421

Offline discussion:

Issue statement

  • If a node type has running instances, the instance's memory and object_store_memory will be added to node_types[node_type]["resources"], unless the user has explicitly specified them in the ray start command. https://github.com/ray-project/ray/blob/3fd6015d8f925d5dc7f61e234fa9f4cf1781c578/python/ray/autoscaler/_private/resource_demand_scheduler.py#L376-L379

  • If a node type doesn't have any running instances, memory and object_store_memory will not be taken into consideration when computing scores for node types to determine which one is best to scale up.

  • score computation:

    • The function _resource_based_utilization_scorer returns four values:
      • (1) gpu_ok: avoid scheduling non-GPU workloads to GPU nodes
      • (2) num_matching_resource_types: The number of resource types that are specified in both the node type and the resource demands. For example, if a node has {A: 10, B: 20, C: 30, memory: 1000, object_store_memory: 1000} and the resource demands are {A: 1, B: 1, C: 1}, the value of num_matching_resource_types will be 3.
      • (3) min(util_by_resources): the minimum resource utilization of all resource types.
        • In this issue, ray.worker.4090.standard has a running instance, and the resource object_store_memory has the lowest utilization, 0, because Ray tasks currently do not support scheduling with object_store_memory. For ray.worker.4090.highmem, because it doesn't have any running instances, the score will only take "CPU": 16, "GPU": 1, "memory": 62277025792, "gram": 24 into consideration, and the utilization is not zero, so ray.worker.4090.highmem will be selected to scale up.
      # Prioritize avoiding gpu nodes for non-gpu workloads first,
      # then prioritize matching multiple resource types,
      # then prioritize using all resources,
      # then prioritize overall balance of multiple resources.
      return (
          gpu_ok,
          num_matching_resource_types,
          min(util_by_resources),
          # util_by_resources should be non empty
          float(sum(util_by_resources)) / len(util_by_resources),
      )
      

kevin85421 avatar Jul 14 '25 23:07 kevin85421

To add, the motivation of min(util_by_resources) is to prefer not to launch a node with resources that are not needed by the task. For example, if a task asks for 2 GPUs, and then to choose from the following node types to scale up:

A: [GPU: 6]
B: [GPU: 2, TPU: 1]

Node type A should still be selected.

rueian avatar Jul 15 '25 01:07 rueian