gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Add extended diagnostics support to must-gather.sh using debug container for dmidecode/lspci collection

Open karthikvetrivel opened this issue 2 weeks ago • 0 comments

Description

Adds optional extended diagnostics to must-gather.sh using a lightweight debug container. Enables complete nvidia-bug-report collection including dmidecode and lspci without adding these tools to the driver container (addressing CVE compliance concerns). This feature is opt-in only; when enabled, users are shown a warning about the external debug container and privileged access requirements before collection begins.

Usage

Standard (existing behavior)

./must-gather.sh

Extended diagnostics

ENABLE_EXTENDED_DIAGNOSTICS=true ./must-gather.sh

Testing

Environment: K8s v1.28+, Tesla T4 GPU, ghcr.io/nvidia/gpu-operator-debug:latest

  1. Ran ./must-gather.sh without flags and verified standard nvidia-bug-report collection works unchanged. Confirmed no extended diagnostics section is added in default mode.
  2. Ran with ENABLE_EXTENDED_DIAGNOSTICS=true and verified the debug container (ghcr.io/nvidia/gpu-operator-debug:latest) attaches successfully via kubectl debug.
  3. dmidecode output: Confirmed BIOS/system information is captured and appended to the bug report.
  4. lspci output: Confirmed verbose PCI device information is captured.
  5. Verified Makefile targets build and push the debug image to ghcr.io/nvidia/gpu-operator-debug:latest (public), and confirmed custom image override works for air-gapped environments.

Sample output verification:

$ zcat nvidia-bug-report_*.log.gz | grep -A3 "EXTENDED DIAGNOSTICS"
*** EXTENDED DIAGNOSTICS (from debug container) ***
$ zcat nvidia-bug-report_*.log.gz | grep "NVIDIA Corporation"
65:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
$ zcat nvidia-bug-report_*.log.gz | grep "BIOS Information" -A2
BIOS Information
    Vendor: American Megatrends Inc.
    Version: 3.3**File size:** Standard 591KB → Extended 596KB (+5KB diagnostics)

karthikvetrivel avatar Dec 09 '25 20:12 karthikvetrivel