benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Alexnet training realdata has a bad performance on Multi-GPU with V100

Open stevenyslins opened this issue 6 years ago • 1 comments

Hi all,

I run tf_cnn_benchmarks.py and checkout branch to cnn_tf_v1.13_compatible, When the number of gpu is greater than 4, the performance is significantly reduced, and gpu utilization does not exceed 90%.

Does anyone have idea? Thank you for your help.

Total images/sec result:

1 GPU : 3861.48
2 GPU : 7831.85
3 GPU : 11164.2
4 GPU : 11548.3
5 GPU : 9254.16
6 GPU : 7057.19
7 GPU : 4178.24
8 GPU : 2811.72

Environment:

Host OS:            CentOS 7
Continaer OS :      Ubuntu 16.04.5
nvidia-driver:      410.79
CPU:                Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
CPU core number:    36
GPU:                Tesla V100-SXM2-32GB *8
Dataset:            ImageNet

Image:              Nvidia GPU Cloud docker image: nvcr.io/nvidia/tensorflow:19.03-py3
package version in container:
python:             3.5.2
tensorflow:         1.13.1
CUDA:               10.1.105
cuDNN:              7.5.0
NCCL:               2.4.3

Command:

python tf_cnn_benchmarks.py \
    --model=alexnet \
    --batch_size=512 \
    --num_gpus=8 \
    --variable_update=replicated \
    --all_reduce_spec=nccl \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_batches=300 \
    --data_dir=/dataset/imagenet_train \
    --data_name=imagenet

GPU utilization:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   35C    P0    63W / 300W |  31394MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   33C    P0    64W / 300W |  31394MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   35C    P0    65W / 300W |  31394MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   36C    P0    79W / 300W |  31394MiB / 32480MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   33C    P0    72W / 300W |  31394MiB / 32480MiB |     26%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   35C    P0    67W / 300W |  31394MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:DB:00.0 Off |                    0 |
| N/A   35C    P0    65W / 300W |  31394MiB / 32480MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:DC:00.0 Off |                    0 |
| N/A   34C    P0    65W / 300W |  31394MiB / 32480MiB |     82%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    100396      C   python                                     31347MiB |
|    1    100396      C   python                                     31347MiB |
|    2    100396      C   python                                     31347MiB |
|    3    100396      C   python                                     31347MiB |
|    4    100396      C   python                                     31347MiB |
|    5    100396      C   python                                     31347MiB |
|    6    100396      C   python                                     31347MiB |
|    7    100396      C   python                                     31347MiB |
+-----------------------------------------------------------------------------+

stevenyslins avatar Apr 19 '19 11:04 stevenyslins

This is somewhat expected, as we never optimized for Alexnet. Its per-step time is very small, making overhead from all-reducing gradients take a greater percentage of the time (although I'm not sure how large the gradients are).

Since tf_cnn_benchmarks is currently unmaintained and not written using modern TF2 features, the performance issue will likely not be addressed.

reedwm avatar Jan 17 '20 03:01 reedwm