rbg icon indicating copy to clipboard operation
rbg copied to clipboard

[RFC] Proportional Scaling for Existing Workloads with Granular Scale-in Control

Open Mag-FelixFelicis opened this issue 1 month ago • 2 comments

Introduction

Hello RBG community, We are from the openFuyao community and are actively working on cloud-native LLM inference acceleration. We are pleased to submit an initial proposal introducing a new Custom Resource Definition (CRD)—ResourceScalingGroup. ResourceScalingGroup is designed to manage existing resources within a Kubernetes cluster, binding heterogeneous resources into unified logical groups. It enables scaling operations at the group level, enhancing scaling flexibility while providing granular scale-in capabilities that allow precise deletion of specified resources groups according to custom policies.

Motivation

In real-world production environments, clusters typically host numerous independently deployed services, particularly large language model inference services. Migrating these services to new frameworks for redeployment incurs significant service restart costs. Additionally, in the current RBG architecture, when a RoleBasedGroup contains gateway-role, router-role, prefill-role, and decode-role, the prefill-role and decode-role cannot be independently scaled according to fixed ratios. Research from the Mooncake paper demonstrates that different P/D (prefill/decode) instance ratios significantly impact inference service performance.

This presents clear optimization directions:

  • Support binding existing cluster resources into unified logical groups, enabling scaling operations at the group level
  • Provide granular scale-in capabilities for resource groups, given the high caution required for scale-in operations in production environments

This proposal introduces a new CRD definition to achieve unified management of cluster resources and precise scale-in operations, effectively reducing service deployment and migration costs while enhancing system stability. This proposal primarily targets scenarios where prefill and decode (P/D) instances are deployed in a fixed ratio and require scaling operations based on resource utilization and service traffic.

Goals

  • Zero-Migration Integration: Bind existing Kubernetes resources (Deployments, StatefulSets, Instances, etc.) to unified logical groups without service restarts or reconfiguration
  • Granular Scale-in Control: Provide precise control over the termination sequence of resource groups during scale-in operations to ensure service continuity
  • Configuration Consistency: Maintain configuration synchronization between original bound resources and their scaled replicas

Non-Goals

  • Do not involve implementation of scaling decision
  • Do not involve modification of RBG interfaces or architectural refactoring

This section provides the ResourceScalingGroup controller architecture diagram.

Proposal

Architecture

This section provides an overall architecture view of the ResourceScalingGroup controller.

Image

Architecture Description

The ResourceScalingGroup controller will be integrated into the rbgs-controller-manager, maintaining compatibility with existing functionality. This controller will manage existing resources within the cluster, automatically synchronizing updates to scaled replicas when original resources are modified. During scale-in operations, it supports precise selection of specific resource groups for deletion.

Feature Overview

The current RoleBasedGroupScalingAdapter only supports scaling operations for individual roles and cannot manage existing resources in the cluster as unified groups. Additionally, RoleBasedGroupScalingAdapter implements scaling by modifying the replicas parameter, which lacks precise scale-in capabilities.

This proposal extends functionality to enable unified scaling of multiple existing resources.

  • The controller watches managed resources, treating the originally bound resources as group 0.

  • Scaling is achieved by updating the replicas field, and each scaled replica belongs to a specific group. Users can perform scale-in operations on selected resource groups by configuring the sortedGroupList field based on the principle of minimizing service impact.

  • When any resource within a group is updated, corresponding resources in other groups are automatically synchronized.

Example

apiVersion: resourcescalinggroup.com/v1
kind: ResourceScalingGroup
metadata:
  labels:
    app.kubernetes.io/name: rsg
  name: rsg
spec:
  replicas: 1 
  scalingGroupTask:
    sortedGroupList:
     - group0
  group:
    groupID: group0
    targetResources:
     - name: prefill
       kind: deployment
       version: apps/v1
       namespace: default
     - name: decode
       kind: deployment
       version: apps/v1
       namespace: default

Fields:

  • replicas: The number of resource groups to manage. As shown in the example, a ResourceScalingGroup instance contains only one group, with each group containing two resources: deployment-prefill and deployment-decode
  • scalingGroupTask.sortedGroupList: The list of resource groups specified for precise scale-in operations. When replicas < status.availableReplicas, the system will delete resource groups in sequence starting from index 0 of the list. When replicas > status.availableReplicas, this parameter has no effect
  • groupID: Identifies the group ID to which managed resources belong. Other group IDs are automatically maintained by controller logic and are not exposed in the spec field
  • targetResources: The list of resources to be bound and managed

Mag-FelixFelicis avatar Nov 21 '25 04:11 Mag-FelixFelicis

Hi @Mag-FelixFelicis

Thanks a lot for this detailed proposal and for sharing the context from the openFuyao community. The problem you’re trying to solve — more fine-grained scale-in control for P/D style deployments — is very aligned with what we’ve been thinking about for RBG as well.

We’re also actively exploring how RBG should better support existing LLM inference services in production, including:

  • managing already-deployed resources, and
  • providing safer and more controllable scale-in behavior.

We’d love to have a more open discussion to understand your use cases and design ideas in more depth, and to share some of our current thoughts on the RBG side.

If you’re interested, we can set up an online sync (Not sure if Dingding is ok for your) or use IM for more detailed technical discussions. Could you leave your email address so we can reach out, and we can find a time that works for us.

cheyang avatar Nov 21 '25 04:11 cheyang

Hi @Mag-FelixFelicis

Thanks a lot for this detailed proposal and for sharing the context from the openFuyao community. The problem you’re trying to solve — more fine-grained scale-in control for P/D style deployments — is very aligned with what we’ve been thinking about for RBG as well.

We’re also actively exploring how RBG should better support existing LLM inference services in production, including:

  • managing already-deployed resources, and
  • providing safer and more controllable scale-in behavior.

We’d love to have a more open discussion to understand your use cases and design ideas in more depth, and to share some of our current thoughts on the RBG side.

If you’re interested, we can set up an online sync (Not sure if Dingding is ok for your) or use IM for more detailed technical discussions. Could you leave your email address so we can reach out, and we can find a time that works for us.

Thank you very much for your thoughtful response and for the kind invitation to collaborate! Please feel free to reach out via email at [email protected]. Looking forward to contributing to and learning from the community.

Mag-FelixFelicis avatar Nov 21 '25 08:11 Mag-FelixFelicis