benchmarks
benchmarks copied to clipboard
Alexnet training realdata has a bad performance on Multi-GPU with V100
Hi all,
I run tf_cnn_benchmarks.py and checkout branch to cnn_tf_v1.13_compatible,
When the number of gpu is greater than 4, the performance is significantly reduced, and gpu utilization does not exceed 90%.
Does anyone have idea? Thank you for your help.
Total images/sec result:
1 GPU : 3861.48
2 GPU : 7831.85
3 GPU : 11164.2
4 GPU : 11548.3
5 GPU : 9254.16
6 GPU : 7057.19
7 GPU : 4178.24
8 GPU : 2811.72
Environment:
Host OS: CentOS 7
Continaer OS : Ubuntu 16.04.5
nvidia-driver: 410.79
CPU: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
CPU core number: 36
GPU: Tesla V100-SXM2-32GB *8
Dataset: ImageNet
Image: Nvidia GPU Cloud docker image: nvcr.io/nvidia/tensorflow:19.03-py3
package version in container:
python: 3.5.2
tensorflow: 1.13.1
CUDA: 10.1.105
cuDNN: 7.5.0
NCCL: 2.4.3
Command:
python tf_cnn_benchmarks.py \
--model=alexnet \
--batch_size=512 \
--num_gpus=8 \
--variable_update=replicated \
--all_reduce_spec=nccl \
--local_parameter_device=gpu \
--nodistortions \
--num_batches=300 \
--data_dir=/dataset/imagenet_train \
--data_name=imagenet
GPU utilization:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 |
| N/A 35C P0 63W / 300W | 31394MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:1C:00.0 Off | 0 |
| N/A 33C P0 64W / 300W | 31394MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 65W / 300W | 31394MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 36C P0 79W / 300W | 31394MiB / 32480MiB | 24% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 |
| N/A 33C P0 72W / 300W | 31394MiB / 32480MiB | 26% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 35C P0 67W / 300W | 31394MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:DB:00.0 Off | 0 |
| N/A 35C P0 65W / 300W | 31394MiB / 32480MiB | 56% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:DC:00.0 Off | 0 |
| N/A 34C P0 65W / 300W | 31394MiB / 32480MiB | 82% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 100396 C python 31347MiB |
| 1 100396 C python 31347MiB |
| 2 100396 C python 31347MiB |
| 3 100396 C python 31347MiB |
| 4 100396 C python 31347MiB |
| 5 100396 C python 31347MiB |
| 6 100396 C python 31347MiB |
| 7 100396 C python 31347MiB |
+-----------------------------------------------------------------------------+
This is somewhat expected, as we never optimized for Alexnet. Its per-step time is very small, making overhead from all-reducing gradients take a greater percentage of the time (although I'm not sure how large the gradients are).
Since tf_cnn_benchmarks is currently unmaintained and not written using modern TF2 features, the performance issue will likely not be addressed.