lingvo Training time

i have training ASR tasks by 4 GPU sync mode and async mode, but the training was so slow; that is training log:

INFO:tensorflow:time:6.841992
INFO:tensorflow:2019.03.30-21:26:33 step:    24 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2489777 log_pplx/logits:9.2489777 loss:9.2489777 loss/logits:9.2489777 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.987753
INFO:tensorflow:2019.03.30-21:26:40 step:    25 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2211275 log_pplx/logits:9.2211275 loss:9.2211275 loss/logits:9.2211275 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.675498
INFO:tensorflow:2019.03.30-21:26:47 step:    26 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.1932364 log_pplx/logits:9.1932364 loss:9.1932364 loss/logits:9.1932364 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.539548
INFO:tensorflow:2019.03.30-21:26:54 step:    27 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2337065 log_pplx/logits:9.2337065 loss:9.2337065 loss/logits:9.2337065 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.667554
INFO:tensorflow:2019.03.30-21:27:01 step:    28 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2273502 log_pplx/logits:9.2273502 loss:9.2273502 loss/logits:9.2273502 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.178711
INFO:tensorflow:2019.03.30-21:27:08 step:    29 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2205391 log_pplx/logits:9.2205391 loss:9.2205391 loss/logits:9.2205391 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.959177
INFO:tensorflow:2019.03.30-21:27:15 step:    30 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2193136 log_pplx/logits:9.2193136 loss:9.2193136 loss/logits:9.2193136 num_samples_in_batch:128 lr:0.00025000001

i have seen lingvo/lingvo/tasks/asr/params/librispeech.py, your training one step may be consume 1s, can you give me some advice?

system info: CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 40%   65C    P2    81W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   68C    P2    84W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 45%   72C    P2    95W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:84:00.0 Off |                  N/A |
| 45%   73C    P2    86W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Mar 30 '19 13:03 dingevin

this is the utilization rate of the GPU，sometime it was zero, but i have recorded time of data pipeline and found it very fast, about 0.2 s each step when i change data pipeline to mine.

Apr 02 '19 08:04 dingevin

Right, it seems that the GPU utilization is pretty low. Can you try some more tensorflow profiling to see where the time is spent?

Apr 04 '19 22:04 drpngx

that's tensorflow timeline profilling timeline_3.json

Apr 05 '19 06:04 dingevin

OK, I'm not an expert in reading traces. I don't really see a smoking gun. Can you follow the instructions to profile, especially trying "advise" and the profiler-ui to see if you find anything useful?

Apr 08 '19 22:04 drpngx

I once suspected the version of tensorflow and try many version of tf-nightly-gpu, but have no effect; now my env is : tf_nightly_gpu-1.13.0.dev20181210 , CUDA-9.0 , CUDNN: 7.3.0 . is right?

then, I have try tensorboard to watch compute time, however it did't display costed time, like

next i have refer to your advise to profile, watch the timeline and can't find any evidence; timeline_20.json

Apr 10 '19 13:04 dingevin

it looks like spend much time before Forward and backward, like op 'RandomStandardNormal' , but i don't konw how to repair it.

Apr 12 '19 12:04 dingevin

The env looks fine. Is this in sync mode or async mode? I checked our run it was in async mode with 8 machines x 2 gpus and got 0.9 steps/s. Since you are using 4 gpus you should expect ~0.2 steps/s and it seems you are getting slightly lower than that.

Apr 16 '19 19:04 jonathanasdf

Sync mode and async mode all have been trying:

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false

bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer

Am I running the command correctly at local ?
my lingvo code version is Oct 2018 ，and it was different form now version. hence i recompile it， but error occured:

[dingzhenyou@gpu46 lingvo]$ bazel build -c opt //lingvo:trainer
INFO: Build options have changed, discarding analysis cache.
INFO: Analysed target //lingvo:trainer (37 packages loaded).
INFO: Found 1 target...
ERROR: /data/dingzhenyou/lingvo/lingvo/core/ops/BUILD:314:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:10:40: fatal error: google/protobuf/port_def.inc: No such file or directory
 #include <google/protobuf/port_def.inc>
                                        ^
compilation terminated.
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.402s, Critical Path: 5.42s
INFO: 9 processes: 9 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

that's my env:

== cat /etc/issue ===============================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902

== uname -a =====================================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.14.5)
numpydoc (0.6.0)
protobuf (3.6.1)
tensorflow-estimator (1.13.0)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.13.0-dev20181210
tf.GIT_VERSION = v1.12.0-rc0-4213-g54b110ae43
tf.COMPILER_VERSION = v1.12.0-rc0-4213-g54b110ae43
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /data/dingzhenyou/train_data_loader/deps/boost/lib/:/data/dingzhenyou/train_data_loader/deps/opencv/lib/:/data/dingzhenyou/cuda/lib64:/usr/local/cuda-9.0/lib:/usr/local/cuda-9.0/lib64:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Wed Apr 17 07:15:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 26%   47C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:82:00.0 Off |                  N/A |
| 24%   43C    P0    64W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   42C    P0    59W / 250W |      0MiB / 12196MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-9.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/lib64/libcudart_static.a
/usr/local/cuda-9.0/lib64/libcudart.so.9.0.176

Apr 17 '19 12:04 dingevin

When you update lingvo version you almost always will need to update tf-nightly version as well.

Apr 17 '19 21:04 jonathanasdf

i have updated tf-nightly, and the training speed still very slow. that's my env : tf_env.txt and here is the training log, each step still spent 7~8s， nohup.txt

that training on async mode:

nohup bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer &

Apr 20 '19 12:04 dingevin

Actually I think that might be the right speed. We get ~1.1s per step on 16 P100s which would be ~5s per step on 4 P100s and P100 is supposedly maybe 30% faster than Titan Xp

Apr 22 '19 19:04 jonathanasdf

Thanks, I will try more advanced machines.

Apr 23 '19 06:04 dingevin

@dingevin did it help to run on P100s/v100. I am getting around ~6-7 seconds per step (or Steps/second: 0.114680) on v100s (single host with 8 GPUs) in async mode. I was expecting it to be faster than P100s. From @jonathanasdf number, for 8 GPU it should be around ~3 seconds. Do you know any other reason for more time per step. I am using latest nightly build and lingvo and using default recipe without changes (expect for learning rate).

May 10 '19 05:05 datavizweb

Hm, that's not good news...

Unfortunately it's very hard to debug performance issues remotely. Please try out some tensorflow profiling options to see if the GPUs are being utilized efficiently and whether the input pipeline is keeping up.

May 10 '19 19:05 jonathanasdf

@datavizweb i have training lingvo on Tesla V100 lately, but the training speeds not be faster. @jonathanasdf from my experiments, the GPUs Utilization was very high, but each step still slow. that's very strange. this is the log librispeech and i have find the cpu load is very high, the load average numbers always > 5;

the Run queue is also high:

$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     48  Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
2
$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores       : 12
$ cat /proc/cpuinfo| grep "processor"| wc -l
48

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 161664 5572960 231776 232380256    0    0     2     4    0    0 18 12 70  0  0
 9  0 161664 5570836 231776 232390528    0    0 10240     0 38752 396294 25  2 73  0  0
62  0 161664 5572952 231776 232390176    0    0     0     0 35005 435967 13  2 85  0  0
 6  0 161664 5551236 231776 232405296    0    0 15104     8 107598 484743 19 11 69  0  0
 5  0 161664 5551484 231776 232405296    0    0     0     0 30623 508293 16  2 82  0  0
 2  0 161664 5551732 231776 232405296    0    0     0     0 31678 177303  7  1 92  0  0
10  0 161664 5551776 231776 232405296    0    0     0     0 43848 262920  9  2 89  0  0
 6  0 161664 5551904 231776 232405296    0    0     0     0 57625 445091 14  2 84  0  0
12  0 161664 5552776 231776 232405296    0    0     0    16 59159 443863 14  2 84  0  0
13  0 161664 5552280 231776 232405296    0    0     0     0 48835 525581 15  2 83  0  0
 5  0 161664 5534872 231776 232425952    0    0 20608     0 113457 338928 17 10 73  0  0
 4  0 161664 5534932 231776 232425952    0    0     0     0 37296 666458 18  3 79  0  0
 1  0 161664 5535120 231776 232425952    0    0     0     0 36611 494778 13  2 84  0  0
 8  0 161664 5541208 231776 232486576    0    0 61696     0 41747 370774 24  3 73  0  0
 9  0 161664 5540580 231776 232487808    0    0     0    16 36153 728851 19  3 78  0  0
 1  0 161664 5535660 231776 232488320    0    0     0     0 27509 145425  8  1 91  0  0
 3  0 161664 5537648 231776 232487968    0    0     0     0 35210 176295  7  1 92  0  0
 3  0 161664 5538180 231776 232487968    0    0     0     0 52778 319345 10  2 87  0  0
 4  0 161664 5538068 231776 232487968    0    0     0     0 55697 442565 13  2 85  0  0
 9  0 161664 5538332 231776 232487968    0    0     0     4 56004 445039 13  2 85  0  0
10  0 161664 5538956 231776 232487968    0    0     0     0 49435 502612 15  2 83  0  0
10  0 161664 5538716 231776 232488720    0    0   896     0 52583 245029 14  8 78  0  0
 8  0 161664 5531764 231776 232488864    0    0     0     0 35941 582441 17  3 80  0  0
12  0 161664 5532012 231776 232488864    0    0     0     0 25962 328696 12  2 86  0  0
64  0 161664 5532880 231776 232488864    0    0     0     0 32843 206023  8  1 91  0  0
10  0 161664 5511304 231776 232488864    0    0     0    39 61909 448895 14  2 84  0  0
 7  0 161664 5511056 231776 232488864    0    0     0     0 60732 441861 14  2 84  0  0

May 14 '19 12:05 dingevin

It seems like there is a dip in the GPU graph every 6-8 seconds, which also corresponds to your step time. Could it be maybe the CPU/disk cannot keep up with the input processing?

May 14 '19 19:05 jonathanasdf

I have test the disk IO, but didn't see any bottleneck, that' s the test info librispeech.log, and i have placed the training data in shared server to test network IO, i can't seen anything. this is the network IO:

May 16 '19 12:05 dingevin

Sorry, I tried checking with a few people on our side and we didn't have any other hypothesis for why it's slower for you :(

May 16 '19 15:05 jonathanasdf

@dingevin we can generate random features (Batches in memory) and give it to model from input_generator. If it still takes 6-7 seconds per step, I/O is not an issue. We then need to profile and figure out exact component that is taking more time. As my next debugging step, I am planning to do that end of next week. If you have time, can you check it?

May 20 '19 23:05 datavizweb

@jonathanasdf

Related to this issue and how the speeds are measured and compared in this thread: are the metrics printed during the run cumulative or per GPU or both? It seems that the steps/sec is fairly constant, but the number of examples per second grows linearly based on the GPUs. And clarification on this would be appreciated.

GPUs   Steps/sec    Examples/sec
1    0.128    6.484
2    0.111    11.513
4    0.103    21.002
8    0.105    42.436

May 24 '19 16:05 patemotter

I believe that depends on whether you are running in sync mode (the default) or async mode.

In sync mode all GPUs need to complete their computation before the step is updated, so the step rate is constant. In async mode each GPU updates parameters independently so the step rate should scale with the number of GPUs.

May 24 '19 17:05 jonathanasdf

watch this.

Jul 13 '19 09:07 zh794390558

Could you guys give some tips on how to profile with GPipe? I've created a model thats GPipe-compatible (uses 4 gpus at the moment). Any particular tools?

Mar 11 '21 07:03 adis98

lingvo lingvo copied to clipboard

Training time

lingvo
lingvo copied to clipboard