volcano feat: Add crossquota scheduler plugin for multi-resource quota management on GPU nodes

What type of PR is this?

[x] Feature
[ ] Bug fix
[ ] Documentation update
[ ] Performance improvement
[ ] Refactoring
[ ] Test addition
[ ] Other

What this PR does / why we need it:

This PR introduces a new Volcano scheduler plugin called crossquota that provides comprehensive resource quota management for non-GPU tasks running on GPU nodes. The plugin extends beyond simple CPU quota management to support multiple resource types including CPU, memory, ephemeral storage, hugepages, and custom resources.

Key Features:

Multi-Resource Support: Unlike the previous CPU-only quota plugin, this plugin supports quotas for CPU, memory, ephemeral storage, hugepages, and any custom scalar resources
Flexible Configuration: Supports both absolute resource quotas (e.g., "4" cores, "8Gi" memory) and percentage-based quotas (e.g., 50% of node allocatable)
Node-Specific Overrides: Allows per-node quota configuration through Kubernetes annotations with higher priority than global plugin configuration
GPU Node Detection: Automatically identifies GPU nodes based on configurable resource patterns with regex support
Real-time Resource Tracking: Monitors and enforces resource usage on GPU nodes in real-time during scheduling
Backward Compatibility: Defaults to CPU-only quotas for existing deployments

Why we need it:

Resource Fairness: Ensures fair resource allocation between GPU and non-GPU workloads on shared GPU nodes
Cost Optimization: Prevents non-GPU workloads from consuming excessive resources on expensive GPU infrastructure
Performance Isolation: Maintains predictable performance for GPU workloads by limiting resource contention
Operational Flexibility: Provides both global and node-specific quota management for different deployment scenarios

Configuration Priority (highest to lowest):

Node annotation volcano.sh/crossquota-<resource-name> (absolute quota)
Node annotation volcano.sh/crossquota-percentage-<resource-name> (percentage quota)
Plugin argument quota.<resource-name> (absolute quota)
Plugin argument quota-percentage.<resource-name> (percentage quota)
Full allocatable resource (no quota)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Implementation Details:

The plugin implements the framework.Plugin interface with proper OnSessionOpen and OnSessionClose lifecycle management
Resource tracking is done through cloned NodeInfo objects with modified allocatable resources based on quota calculations
Predicate functions ensure quota compliance during task scheduling
Event handlers maintain real-time resource usage tracking for allocated/deallocated tasks
Comprehensive error handling for invalid configurations and resource parsing

Testing:

Unit test coverage: 79.7% with comprehensive test cases for all major functionality
Tests cover plugin initialization, argument parsing, GPU node detection, CPU pod identification, quota calculation, predicate logic, and event handling
Mock implementations for api.Devices interface to support GPU sharing scenarios
Helper functions for consistent test data construction

Performance Considerations:

Minimal overhead during scheduling as quota calculations are cached per node
Efficient resource tracking through direct map lookups
Regex compilation is done once during plugin initialization

Does this PR introduce a user-facing change?

feat: Add crossquota scheduler plugin for multi-resource quota management on GPU nodes

- Support for CPU, memory, ephemeral storage, hugepages, and custom resource quotas
- Flexible configuration with both absolute and percentage-based quotas
- Node-specific quota overrides through Kubernetes annotations
- Automatic GPU node detection with regex pattern support
- Real-time resource tracking and enforcement during scheduling
- Backward compatibility with existing CPU quota configurations
  
  Configuration example:
  ```yaml
  actions: "enqueue, allocate, backfill"
  tiers:
  - plugins:
    - name: crossquota
      arguments:
        gpu-resource-names: "nvidia.com/gpu,amd.com/gpu"
        quota-resources: "cpu,memory,ephemeral-storage"
        quota.cpu: "4"
        quota.memory: "8Gi"
        quota-percentage.ephemeral-storage: "50"

Node annotation example:

metadata:
  annotations:
    volcano.sh/crossquota-cpu: "6"
    volcano.sh/crossquota-memory: "12Gi"
    volcano.sh/crossquota-percentage-ephemeral-storage: "75"

Aug 18 '25 08:08 kingeasternsun

Is this to avoid the CPU load occupying the GPU node resources and causing the GPU task to be unable to be scheduled? If that's true, https://github.com/volcano-sh/volcano/pull/4391 and https://github.com/volcano-sh/volcano/pull/4454 is also a way.

Aug 19 '25 01:08 Monokaix

Is this to avoid the CPU load occupying the GPU node resources and causing the GPU task to be unable to be scheduled? If that's true, #4391 and #4454 is also a way.

The ResourceFit strategy falls short of customer requirements, as it only provides a soft preference to steer CPU workloads away from GPU nodes. In contrast, crossquota introduces a hard isolation mechanism: CPU workloads are permitted to run on GPU nodes, but their CPU usage is strictly capped by a predefined quota.

Aug 19 '25 01:08 kingeasternsun

Wait for this https://github.com/volcano-sh/volcano/pull/4454

Sep 09 '25 03:09 kingeasternsun

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign jessestutler for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Nov 04 '25 01:11 volcano-sh-bot