gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Enhance MIG Support Detection for NVIDIA GPUs introduced in R580

Open RuixiangMa opened this issue 1 month ago • 1 comments

Summary

This PR significantly improves the Multi-Instance GPU (MIG) capability detection logic in the NVIDIA GPU Operator by expanding the list of supported GPU architectures and implementing a more comprehensive pattern-matching approach.

Changes Made

1. Enhanced MIG Detection Logic (controllers/state_manager.go)

  • Refactored the hasMIGCapableGPU function to use a dedicated helper function isMIGCapableGPUProduct
  • Expanded MIG support from 3 basic models to comprehensive architecture coverage
  • Implemented structured pattern matching with clear architectural categorization

2. Comprehensive GPU Architecture Support

The updated detection now supports:

Hopper Architecture (Data Center)

  • H100, H800, H200, H20,GH200

Ampere Architecture

  • A100, A800, A30

Blackwell Architecture (Next Generation)

  • GB200, B200, GB300, B300

Professional Workstation GPUs

  • RTX PRO 6000
  • RTX PRO 5000
  • Dual format support: Both "rtx-pro-6000" and "rtx pro 6000" naming conventions

Verification and Testing

Test Coverage

  • All supported GPU models across architectures
  • Multiple naming format variations
  • Negative test cases for non-MIG GPUs (T4, V100)
  • Edge cases (empty strings, partial matches)

Test Results

截屏2025-11-17 16 47 44

RuixiangMa avatar Nov 17 '25 08:11 RuixiangMa

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Nov 17 '25 08:11 copy-pr-bot[bot]