Different speed test results for different machine configuration
Related to Model/Framework(s) (e.g. GNMT/PyTorch or FasterTransformer/All) Model : UNet (backbone : Vgg16) (Semantic Segmentation) Framework : Tensorflow/pytorch
Describe the bug A clear and concise description of what the bug is.
I have three machine: Machine 1 : A100 4 Gpus x 40G PCIE, AMD/AMD EPYC 7662 64-Core Processor Machine 2 : A100 4 Gpus x 80G PCIE, AuthenticAMD/ AMD EPYC 7713 64-Core Processor Machine 3 : A100 4 Gpus x 80G SXM4 Form Factor, AuthenticAMD/AMD EPYC 7713 64-Core Processor
I have first performed MLPerf1.1 training result submitted by Nvidia in MLCommons. I have used same algorithm with same hyperparameters (same batch size even in machine 1, each gpu has half memory size). Training speed results are following (fastest to slowest)
Machine 3 (fastest) > Machine 2 (faster) > Machine 1 (fast)
Now, I am using simple tensorflow multi gpu training example to solve semantic segmentation of Cityscapes dataset. All parameters are same here as well. (https://www.tensorflow.org/tutorials/distribute/keras#set_up_the_input_pipeline).
Surprisingly, here I see different training speed behavior.
Machine 2 (fastest) > Machine 1 (faster) > Machine 3 (fast)
But in single Gpu training, behavior is again different.
Machine 3 (fastest) > Machine 2 (faster) > Machine 1 (fast)
When checked tensorflow profiling of my training, I see machine 3 (sxm), has more kernel launch time and host compute time then other two machines. why do we see these higher time even machine has best processor and gpus?
why I am getting different result? what things I am missing here to fully leverage sxm form factor hardware?
tf version : 2.5.0 (tensorflow docker image)
Along with above mentioned example, I have also tried training model using https://github.com/Tramac/awesome-semantic-segmentation-pytorch repo. I still see similar training speeds, same as tensorflow developed model.
Machine 2 (fastest) > Machine 1 (faster) > Machine 3 (fast)
To Reproduce Steps to reproduce the behavior:
- Install '...'
- Set "..."
- Launch '...'
Expected behavior A clear and concise description of what you expected to happen.
Environment Please provide at least:
- Container version (e.g. pytorch:19.05-py3):
- GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
- CUDA driver version (e.g. 418.67):
Please let me know if more information is needed. Thank you