8xH100 server training time higher than 8xA100 server.

Open PurvangL opened this issue 2 years ago • 0 comments

Related to Model/Framework(s) Tensorflow/Pytorch

Describe the bug While running Yolox on servers described, H100 total training time is higher compared to A100 server. I also ran test script on servers (ahn1.txt), where A100 seems to be faster. could you please tell what is actual reason for this behavior?

train command

time nsys profile -t cuda,nvtx,osrt,cudnn,cublas --stats=true -x true --force-overwrite true python3 ahn1.py --dataset=random/ --num_classes=30 --batch_size=256 --num_epochs=8 --num_gpus=8

ahn1.txt

A100 profiling log (time taken to complete: w/ nsys :7m48.808s, w/o nsys: 3m51.057s) a100_run.log

H100 profiling log (time taken to complete: w/ nsys : 23m16.4s, w/o nsys: 5m29.673s) h100_run.log

I have two server as shown below. Server1 (8x H100)

lscpu

nvidia-smi topo -m

Server2 (8x A100)

lscpu

nvidia-smi topo -m

To Reproduce Steps to reproduce the behavior:

Install '...'
Set "..."
Launch '...'

docker run -d --gpus all nvcr.io/nvidia/tensorflow:22.04-tf2-py3
time nsys profile -t cuda,nvtx,osrt,cudnn,cublas --stats=true -x true --force-overwrite true python3 ahn1.py --dataset=random/ --num_classes=30 --batch_size=256 --num_epochs=8 --num_gpus=8

Expected behavior A clear and concise description of what you expected to happen.

Environment Please provide at least:

Container version (e.g. pytorch:19.05-py3): tensorflow (nvcr.io/nvidia/tensorflow:22.04-tf2-py3) / pytorch (nvcr.io/nvidia/pytorch:23.04-py3)
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 8x A100/ 8x H100
CUDA driver version (e.g. 418.67): H100 -12.0 / A100 - 11.6
CUDNN: H100 - 8.8.1.3-1+cuda12.0 / A100 - 8.4.1.50-1+cuda11.6
NCCL: H100 - 2.16.2-1+cuda12.0 / A100 - 2.12.12-1+cuda11.6

Jul 31 '23 21:07 PurvangL