DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

diag --configfile option is silently ignored if --parameters options is present

Open dmonakhov opened this issue 1 year ago • 0 comments

I use aws-platform.yaml config file for passing platform characteristics which are never changes:

version: AWS-0.1
spec: dcgm-diag-v1
skus:
  - name: NVIDIA H100 80GB HBM3 p5.48xlarge
    id: 2330
    pcie:
      is_allowed: true
      h2d_d2h_single_pinned:
        min_pci_generation: 5.0
        min_pci_width: 16.0
        min_bandwidth: 14.0
        max_latency: 5
      h2d_d2h_single_unpinned:
        min_pci_generation: 5.0
        min_pci_width: 16.0
        min_bandwidth: 14.0
      gpu_nvlinks_expected_up: 18
      nvswitch_nvlinks_expected_up: 6

But also want to customize other parameters like test_duration and use --parameters option for this:

dcgmi diag --verbose --json --configfile diag-aws.yaml --run long --parameters memtest.test_duration=120

But it is appeared that --configfile options will be silently ignored if --parameters option is present. And nvvs will called in configless mode:

 /usr/share/nvidia-validation-suite/nvvs -j -z --specifiedtest long --parameters memtest.test_duration=120 --configless -v --indexes 0,1,2,3,4,5,6,7 

Which is very cont intuitive and makes it hard to quick parameters prototyping, because either configfile, or parameters should be used. And passing all system parameters with --parameters seems not very practical.

dmonakhov avatar Jan 26 '24 23:01 dmonakhov