8xH100 server training time higher than 8xA100 server.
Related to Model/Framework(s) Tensorflow/Pytorch
Describe the bug While running Yolox on servers described, H100 total training time is higher compared to A100 server. I also ran test script on servers (ahn1.txt), where A100 seems to be faster. could you please tell what is actual reason for this behavior?
train command
time nsys profile -t cuda,nvtx,osrt,cudnn,cublas --stats=true -x true --force-overwrite true python3 ahn1.py --dataset=random/ --num_classes=30 --batch_size=256 --num_epochs=8 --num_gpus=8
A100 profiling log (time taken to complete: w/ nsys :7m48.808s, w/o nsys: 3m51.057s) a100_run.log
H100 profiling log (time taken to complete: w/ nsys : 23m16.4s, w/o nsys: 5m29.673s) h100_run.log
I have two server as shown below. Server1 (8x H100)
lscpu
nvidia-smi topo -m
Server2 (8x A100)
lscpu
nvidia-smi topo -m
To Reproduce Steps to reproduce the behavior:
- Install '...'
- Set "..."
- Launch '...'
docker run -d --gpus all nvcr.io/nvidia/tensorflow:22.04-tf2-py3
time nsys profile -t cuda,nvtx,osrt,cudnn,cublas --stats=true -x true --force-overwrite true python3 ahn1.py --dataset=random/ --num_classes=30 --batch_size=256 --num_epochs=8 --num_gpus=8
Expected behavior A clear and concise description of what you expected to happen.
Environment Please provide at least:
- Container version (e.g. pytorch:19.05-py3): tensorflow (nvcr.io/nvidia/tensorflow:22.04-tf2-py3) / pytorch (nvcr.io/nvidia/pytorch:23.04-py3)
- GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 8x A100/ 8x H100
- CUDA driver version (e.g. 418.67): H100 -12.0 / A100 - 11.6
- CUDNN: H100 - 8.8.1.3-1+cuda12.0 / A100 - 8.4.1.50-1+cuda11.6
- NCCL: H100 - 2.16.2-1+cuda12.0 / A100 - 2.12.12-1+cuda11.6