gpu-operator PowerEdge XE9680 H100 Support

Hi, we're maintaining an OpenShift v4.10 cluster, and recently provisioned Dell PowerEdge XE9680 servers as GPU nodes. We are working with NVIDIA GPU Operator v22.9.1 as for now (aware of the EOL) and the GPUs seem to be exposed and usable, nonetheless, we don't experience the GPU performace we were expecting.

These servers are based on NVIDIA HGX H100 architecture, and according to the NVIDIA GPU Operator v22.9.2 release notes:

Added support for the NVIDIA HGX H100 System in the Supported NVIDIA GPUs and Systems table on the Platform Support page.

Added 525.85.12 as the recommended driver version and 3.1.6 as the recommended DCGM version in the GPU Operator Component Matrix. These updates enable support for the NVIDIA HGX H100 System.

Does that mean upgrading the operator and the driver to this version could improve the reduced performance? Could you please elaborate on the improvements of this driver version?

In addition, which benchmarking tools would you recommend to test these GPUs?

Nov 19 '23 21:11 doronkg

Updating status, we've upgraded the NVIDIA GPU Operator to v22.9.2, while upgrading the NVIDIA GPU Driver to v525.85.12. The v22.9.2 installs Driver v525.60.13 by default, in order to install v525.85.12 we added the following config to the clusterpolicy.nvidia.com CRD instance:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
...
spec:
  driver:
    image: >-
      nvcr.io/nvidia/driver:525.85.12-rhcos4.10
...

After the installation, we restarted the nodes and waited for all the nvidia-gpu-operator pods to run successfully. We used the following NVIDIA testing performance tool to execute a benchmark upon the H100 GPU cards.

Used the following Deployment to execute the benchmark in parallel on all GPUs in the node:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-benchmark
  namespace: gpu-tests
spec:
  replicas: 8
  selector:
    matchLabels:
      app: gpu-benchmark
  template:
    metadata:
      labels:
        app: gpu-benchmark
    spec: 
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
      containers:
        - name: gpu-benchmark
          image: nvcr.io/nvidia/pytorch:23.10-py3
          command:
            - bash
            - 'c'
            - >
              python
              ./DeepLearningExamples/PyTorch/Classification/ConvNets/multiproc.py
              --nproc_per_node 1
              ./DeepLearningExamples/PyTorch/Classification/ConvNets/launch.py
              --model resnet50 --precision AMP --mode benchmark_training
              --platform DGXA100 --data-backend synthetic --raport-file
              benchmark.json --epochs 1 --prof 100 ./ && sleep infinity
          resources:
            limits:
              cpu: 500m
              memory: 2G
              nvidia.com/gpu: '1'
            requests:
              cpu: 500m
              memory: 2G

The benchmark resulted in significant performance improvement! We observed the train.total.ips metric (images processed per second) between the two executions:

Driver 525.60.13 resulted in an inconsistent and unstable rate ranging between 1200-2600 ips in each iteration.
Driver 525.85.12 resulted in a consistent and stable rate of ~2600 ips in each iteration. No drops below 2600, and several peaks above 3000.

It's safe to say that the Driver upgrade was essential to achieve better and more stable performance. The Driver v525.85.12 docs reflect several references regarding H100 bug fixes and performance improvements.

We're looking forward to upgrading the NVIDIA GPU Operator to later versions and progressing towards the R535 Driver family.

Nov 23 '23 11:11 doronkg

UPDATE: We've upgraded to NVIDIA GPU Operator v23.3.2 with GPU Driver v535.104.12 (recommended, not default). The benchmark resulted in train.total_ips average of ~2600 ips in each iteration.

Jan 14 '24 10:01 doronkg

gpu-operator gpu-operator copied to clipboard

PowerEdge XE9680 H100 Support

gpu-operator
gpu-operator copied to clipboard