ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] Placement group creation hangs when create pgs in parallel

Open jjyao opened this issue 2 years ago • 6 comments

What happened + What you expected to happen

When we try to create multiple placement groups competing for the same resources, it will hang.

Versions / Dependencies

master

Reproduction script

import ray
from ray.util.placement_group import (
    placement_group,
    placement_group_table,
    remove_placement_group
)

ray.init(num_cpus=12, num_gpus=1, resources={'guppy-mm2': 1})

@ray.remote(num_gpus=1, resources={'guppy-mm2': 0.01})
def task1(input):
    return 1

@ray.remote(num_gpus=1, resources={'guppy-mm2': 0.01})
def task2(input):
    return 2

@ray.remote
def manage_tasks(input):
    pg = placement_group([{'guppy-mm2': 0.01, 'CPU': 1, 'GPU': 1}], strategy="STRICT_PACK")
    ray.get(pg.ready())
    print(f"jjyao pg is ready {input}")
    pg_strategy = ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg)

    future1 = task1.options(scheduling_strategy=pg_strategy).remote(input)
    output = ray.get(task2.options(scheduling_strategy=pg_strategy).remote(future1))

    remove_placement_group(pg)
    print(f"jjyao pg is removed")
    return output

outputs = ray.get([manage_tasks.remote(i) for i in range(100)])
======== Autoscaler status: 2022-08-30 04:03:17.970564 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_3b6d2c891c0a10fe7ef0c48be89197e540f0401c1ebf30cd3b34b334
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/12.0 CPU
 0.0/1.0 GPU
 0.0/1.0 guppy-mm2
 0.00/17.739 GiB memory
 0.00/8.870 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'guppy-mm2': 0.01, 'GPU': 1.0} * 1 (STRICT_PACK): 13+ pending placement groups

Issue Severity

No response

jjyao avatar Aug 30 '22 04:08 jjyao

-   bundles:
    -   bundle_id:
            bundle_index: 0
            placement_group_id: 0bf31545adf469355ada81a31e5501000000
        node_id: ''
        unit_resources:
            CPU: 1.0
            GPU: 1.0
            guppy-mm2: 0.01
    is_detached: false
    name: ''
    placement_group_id: 0bf31545adf469355ada81a31e5501000000
    state: PENDING
    stats:
        creation_request_received_ns: '1661883444256703479'
        end_to_end_creation_latency_us: '0'
        highest_retry_delay_ms: 0.0
        scheduling_attempt: 1
        scheduling_latency_us: '0'
        scheduling_started_time_ns: '1661883444257119646'
        scheduling_state: INFEASIBLE

It says it is "infeasible" which doesn't make sense. We should fix it immediately.

@clay4444 do you have some bandwidth to handle this? I remember you've implemented the infeasible pg before; https://github.com/ray-project/ray/pull/16188

rkooo567 avatar Aug 30 '22 18:08 rkooo567

I will take a look at this issue few days

larrylian avatar Sep 12 '22 14:09 larrylian

Are there any updates or potential timelines for this issue?

garg02 avatar Sep 21 '22 05:09 garg02

Are there any updates or potential timelines for this issue?

I am currently busy with some urgent things, it may take a while to locate this problem. If this issue has a high impact, I can raise the priority to look at this issue.

larrylian avatar Sep 21 '22 06:09 larrylian

Thank you for your response. As described in the initial issue, I am able to work around the issue by merging sequential tasks together and using node_strategy. That said, my patch is unideal and could benefit a lot from parallel placement groups.

garg02 avatar Sep 21 '22 07:09 garg02

Thank you for your response. As described in the initial issue, I am able to work around the issue by merging sequential tasks together and using node_strategy. That said, my patch is unideal and could benefit a lot from parallel placement groups.

OK, Thank for your patch.

larrylian avatar Sep 21 '22 07:09 larrylian

I am looking at this now

rkooo567 avatar Oct 12 '22 14:10 rkooo567

cc @scv119 (since I need to prioritize other interrupts)

rkooo567 avatar Oct 13 '22 02:10 rkooo567

cc @rkooo567 @scv119 Then I'll take a look at this issue first before this weekend. If there is no result next week you can assign it to someone else.

larrylian avatar Oct 13 '22 02:10 larrylian

image The problem is here. After this change it is solved. I will mention PR and supplementary test cases in the past two days.

larrylian avatar Oct 14 '22 04:10 larrylian