gpu-operator
gpu-operator copied to clipboard
PowerEdge XE9680 H100 Support
Hi, we're maintaining an OpenShift v4.10 cluster, and recently provisioned Dell PowerEdge XE9680 servers as GPU nodes. We are working with NVIDIA GPU Operator v22.9.1 as for now (aware of the EOL) and the GPUs seem to be exposed and usable, nonetheless, we don't experience the GPU performace we were expecting.
These servers are based on NVIDIA HGX H100 architecture, and according to the NVIDIA GPU Operator v22.9.2 release notes:
- Added support for the NVIDIA HGX H100 System in the Supported NVIDIA GPUs and Systems table on the Platform Support page.
- Added 525.85.12 as the recommended driver version and 3.1.6 as the recommended DCGM version in the GPU Operator Component Matrix. These updates enable support for the NVIDIA HGX H100 System.
Does that mean upgrading the operator and the driver to this version could improve the reduced performance? Could you please elaborate on the improvements of this driver version?
In addition, which benchmarking tools would you recommend to test these GPUs?
Updating status, we've upgraded the NVIDIA GPU Operator to v22.9.2, while upgrading the NVIDIA GPU Driver to v525.85.12.
The v22.9.2 installs Driver v525.60.13 by default, in order to install v525.85.12 we added the following config to the clusterpolicy.nvidia.com
CRD instance:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
...
spec:
driver:
image: >-
nvcr.io/nvidia/driver:525.85.12-rhcos4.10
...
After the installation, we restarted the nodes and waited for all the nvidia-gpu-operator
pods to run successfully.
We used the following NVIDIA testing performance tool to execute a benchmark upon the H100 GPU cards.
Used the following Deployment to execute the benchmark in parallel on all GPUs in the node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-benchmark
namespace: gpu-tests
spec:
replicas: 8
selector:
matchLabels:
app: gpu-benchmark
template:
metadata:
labels:
app: gpu-benchmark
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
containers:
- name: gpu-benchmark
image: nvcr.io/nvidia/pytorch:23.10-py3
command:
- bash
- 'c'
- >
python
./DeepLearningExamples/PyTorch/Classification/ConvNets/multiproc.py
--nproc_per_node 1
./DeepLearningExamples/PyTorch/Classification/ConvNets/launch.py
--model resnet50 --precision AMP --mode benchmark_training
--platform DGXA100 --data-backend synthetic --raport-file
benchmark.json --epochs 1 --prof 100 ./ && sleep infinity
resources:
limits:
cpu: 500m
memory: 2G
nvidia.com/gpu: '1'
requests:
cpu: 500m
memory: 2G
The benchmark resulted in significant performance improvement!
We observed the train.total.ips
metric (images processed per second) between the two executions:
- Driver 525.60.13 resulted in an inconsistent and unstable rate ranging between 1200-2600 ips in each iteration.
- Driver 525.85.12 resulted in a consistent and stable rate of ~2600 ips in each iteration. No drops below 2600, and several peaks above 3000.
It's safe to say that the Driver upgrade was essential to achieve better and more stable performance. The Driver v525.85.12 docs reflect several references regarding H100 bug fixes and performance improvements.
We're looking forward to upgrading the NVIDIA GPU Operator to later versions and progressing towards the R535 Driver family.
UPDATE:
We've upgraded to NVIDIA GPU Operator v23.3.2 with GPU Driver v535.104.12 (recommended, not default).
The benchmark resulted in train.total_ips
average of ~2600 ips in each iteration.