DCGM
DCGM copied to clipboard
diag --configfile option is silently ignored if --parameters options is present
I use aws-platform.yaml config file for passing platform characteristics which are never changes:
version: AWS-0.1
spec: dcgm-diag-v1
skus:
- name: NVIDIA H100 80GB HBM3 p5.48xlarge
id: 2330
pcie:
is_allowed: true
h2d_d2h_single_pinned:
min_pci_generation: 5.0
min_pci_width: 16.0
min_bandwidth: 14.0
max_latency: 5
h2d_d2h_single_unpinned:
min_pci_generation: 5.0
min_pci_width: 16.0
min_bandwidth: 14.0
gpu_nvlinks_expected_up: 18
nvswitch_nvlinks_expected_up: 6
But also want to customize other parameters like test_duration and use --parameters option for this:
dcgmi diag --verbose --json --configfile diag-aws.yaml --run long --parameters memtest.test_duration=120
But it is appeared that --configfile options will be silently ignored if --parameters option is present. And nvvs will called in configless mode:
/usr/share/nvidia-validation-suite/nvvs -j -z --specifiedtest long --parameters memtest.test_duration=120 --configless -v --indexes 0,1,2,3,4,5,6,7
Which is very cont intuitive and makes it hard to quick parameters prototyping, because either configfile, or parameters should be used. And passing all system parameters with --parameters seems not very practical.