verl icon indicating copy to clipboard operation
verl copied to clipboard

[RFC] split resource pool

Open yyDing1 opened this issue 1 month ago • 2 comments

Motivation

In certain scenarios, we need to use a resource pool to initialize multiple instances. https://github.com/volcengine/verl/pull/4226 and https://github.com/volcengine/verl/pull/4233 may introduce some issues, SubRayResourcePool may be a safer implementation without any modification about RayResourcePool itself.

Image

Proposal API change

class RayResourcePool:
    ...
    def split(self, split_size: int | list[int]) -> list["SubRayResourcePool"]:
        ...

class SubRayResourcePool(RayResourcePool):
    def __init__(
        self,
        start_bundle_idx: int,
        subgroup_world_size: int,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.start_bundle_idx = start_bundle_idx
        self.subgroup_world_size = subgroup_world_size



    @property
    def world_size(self):
        return self.subgroup_world_size

Expected Behavior:

Image

The resource pool maintains the placement group and bundle information. Conceptually, it looks like:

resource_pool
    pgs = [pgs_1, pgs_2, ..., pgs_n]
    pgs_1 = [boundle_1, boundle_2, ... boundle_8]
    pgs_2 = [boundle_1, boundle_2, ... boundle_8]
    ...
    pgs_n = [boundle_1, boundle_2, ... boundle_8]

To support logical subdivision of the resource pool, I will introduce two additional fields in each resource pool to track the slice of bundles currently in use: subresource_pool.start_bundle_index and subresource_pool.subgroup_world_size (which together act as an offset and length).

yyDing1 avatar Nov 24 '25 08:11 yyDing1

@zw0610 @vermouth1992

wuxibin89 avatar Nov 24 '25 09:11 wuxibin89

Make sense to me

vermouth1992 avatar Nov 24 '25 11:11 vermouth1992