Different speed test results for different machine configuration

Open PurvangL opened this issue 3 years ago • 0 comments

Related to Model/Framework(s) (e.g. GNMT/PyTorch or FasterTransformer/All) Model : UNet (backbone : Vgg16) (Semantic Segmentation) Framework : Tensorflow/pytorch

Describe the bug A clear and concise description of what the bug is.

I have three machine: Machine 1 : A100 4 Gpus x 40G PCIE, AMD/AMD EPYC 7662 64-Core Processor Machine 2 : A100 4 Gpus x 80G PCIE, AuthenticAMD/ AMD EPYC 7713 64-Core Processor Machine 3 : A100 4 Gpus x 80G SXM4 Form Factor, AuthenticAMD/AMD EPYC 7713 64-Core Processor

I have first performed MLPerf1.1 training result submitted by Nvidia in MLCommons. I have used same algorithm with same hyperparameters (same batch size even in machine 1, each gpu has half memory size). Training speed results are following (fastest to slowest)

Machine  3 (fastest) > Machine  2 (faster) > Machine  1 (fast)

Now, I am using simple tensorflow multi gpu training example to solve semantic segmentation of Cityscapes dataset. All parameters are same here as well. (https://www.tensorflow.org/tutorials/distribute/keras#set_up_the_input_pipeline).

Surprisingly, here I see different training speed behavior.

Machine  2 (fastest) > Machine  1 (faster) > Machine  3 (fast)

But in single Gpu training, behavior is again different.

Machine  3 (fastest) > Machine  2 (faster) > Machine  1 (fast)

When checked tensorflow profiling of my training, I see machine 3 (sxm), has more kernel launch time and host compute time then other two machines. why do we see these higher time even machine has best processor and gpus?

why I am getting different result? what things I am missing here to fully leverage sxm form factor hardware?

tf version : 2.5.0 (tensorflow docker image)

Along with above mentioned example, I have also tried training model using https://github.com/Tramac/awesome-semantic-segmentation-pytorch repo. I still see similar training speeds, same as tensorflow developed model.

Machine  2 (fastest) > Machine  1 (faster) > Machine  3 (fast)

To Reproduce Steps to reproduce the behavior:

Install '...'
Set "..."
Launch '...'

Expected behavior A clear and concise description of what you expected to happen.

Environment Please provide at least:

Container version (e.g. pytorch:19.05-py3):
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB):
CUDA driver version (e.g. 418.67):

Please let me know if more information is needed. Thank you

Sep 02 '22 19:09 PurvangL