ray
ray copied to clipboard
[Core] Placement group creation hangs when create pgs in parallel
What happened + What you expected to happen
When we try to create multiple placement groups competing for the same resources, it will hang.
Versions / Dependencies
master
Reproduction script
import ray
from ray.util.placement_group import (
placement_group,
placement_group_table,
remove_placement_group
)
ray.init(num_cpus=12, num_gpus=1, resources={'guppy-mm2': 1})
@ray.remote(num_gpus=1, resources={'guppy-mm2': 0.01})
def task1(input):
return 1
@ray.remote(num_gpus=1, resources={'guppy-mm2': 0.01})
def task2(input):
return 2
@ray.remote
def manage_tasks(input):
pg = placement_group([{'guppy-mm2': 0.01, 'CPU': 1, 'GPU': 1}], strategy="STRICT_PACK")
ray.get(pg.ready())
print(f"jjyao pg is ready {input}")
pg_strategy = ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg)
future1 = task1.options(scheduling_strategy=pg_strategy).remote(input)
output = ray.get(task2.options(scheduling_strategy=pg_strategy).remote(future1))
remove_placement_group(pg)
print(f"jjyao pg is removed")
return output
outputs = ray.get([manage_tasks.remote(i) for i in range(100)])
======== Autoscaler status: 2022-08-30 04:03:17.970564 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_3b6d2c891c0a10fe7ef0c48be89197e540f0401c1ebf30cd3b34b334
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/12.0 CPU
0.0/1.0 GPU
0.0/1.0 guppy-mm2
0.00/17.739 GiB memory
0.00/8.870 GiB object_store_memory
Demands:
{'CPU': 1.0, 'guppy-mm2': 0.01, 'GPU': 1.0} * 1 (STRICT_PACK): 13+ pending placement groups
Issue Severity
No response
- bundles:
- bundle_id:
bundle_index: 0
placement_group_id: 0bf31545adf469355ada81a31e5501000000
node_id: ''
unit_resources:
CPU: 1.0
GPU: 1.0
guppy-mm2: 0.01
is_detached: false
name: ''
placement_group_id: 0bf31545adf469355ada81a31e5501000000
state: PENDING
stats:
creation_request_received_ns: '1661883444256703479'
end_to_end_creation_latency_us: '0'
highest_retry_delay_ms: 0.0
scheduling_attempt: 1
scheduling_latency_us: '0'
scheduling_started_time_ns: '1661883444257119646'
scheduling_state: INFEASIBLE
It says it is "infeasible" which doesn't make sense. We should fix it immediately.
@clay4444 do you have some bandwidth to handle this? I remember you've implemented the infeasible pg before; https://github.com/ray-project/ray/pull/16188
I will take a look at this issue few days
Are there any updates or potential timelines for this issue?
Are there any updates or potential timelines for this issue?
I am currently busy with some urgent things, it may take a while to locate this problem. If this issue has a high impact, I can raise the priority to look at this issue.
Thank you for your response. As described in the initial issue, I am able to work around the issue by merging sequential tasks together and using node_strategy. That said, my patch is unideal and could benefit a lot from parallel placement groups.
Thank you for your response. As described in the initial issue, I am able to work around the issue by merging sequential tasks together and using node_strategy. That said, my patch is unideal and could benefit a lot from parallel placement groups.
OK, Thank for your patch.
I am looking at this now
cc @scv119 (since I need to prioritize other interrupts)
cc @rkooo567 @scv119 Then I'll take a look at this issue first before this weekend. If there is no result next week you can assign it to someone else.
The problem is here. After this change it is solved. I will mention PR and supplementary test cases in the past two days.