gpu-operator
gpu-operator copied to clipboard
Add extended diagnostics support to must-gather.sh using debug container for dmidecode/lspci collection
Description
Adds optional extended diagnostics to must-gather.sh using a lightweight debug container. Enables complete nvidia-bug-report collection including dmidecode and lspci without adding these tools to the driver container (addressing CVE compliance concerns). This feature is opt-in only; when enabled, users are shown a warning about the external debug container and privileged access requirements before collection begins.
Usage
Standard (existing behavior)
./must-gather.sh
Extended diagnostics
ENABLE_EXTENDED_DIAGNOSTICS=true ./must-gather.sh
Testing
Environment: K8s v1.28+, Tesla T4 GPU, ghcr.io/nvidia/gpu-operator-debug:latest
- Ran ./must-gather.sh without flags and verified standard nvidia-bug-report collection works unchanged. Confirmed no extended diagnostics section is added in default mode.
- Ran with ENABLE_EXTENDED_DIAGNOSTICS=true and verified the debug container (ghcr.io/nvidia/gpu-operator-debug:latest) attaches successfully via kubectl debug.
- dmidecode output: Confirmed BIOS/system information is captured and appended to the bug report.
- lspci output: Confirmed verbose PCI device information is captured.
- Verified Makefile targets build and push the debug image to ghcr.io/nvidia/gpu-operator-debug:latest (public), and confirmed custom image override works for air-gapped environments.
Sample output verification:
$ zcat nvidia-bug-report_*.log.gz | grep -A3 "EXTENDED DIAGNOSTICS"
*** EXTENDED DIAGNOSTICS (from debug container) ***
$ zcat nvidia-bug-report_*.log.gz | grep "NVIDIA Corporation"
65:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
$ zcat nvidia-bug-report_*.log.gz | grep "BIOS Information" -A2
BIOS Information
Vendor: American Megatrends Inc.
Version: 3.3**File size:** Standard 591KB → Extended 596KB (+5KB diagnostics)