lingvo
lingvo copied to clipboard
Training time
i have training ASR tasks by 4 GPU sync mode and async mode, but the training was so slow; that is training log:
INFO:tensorflow:time:6.841992
INFO:tensorflow:2019.03.30-21:26:33 step: 24 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2489777 log_pplx/logits:9.2489777 loss:9.2489777 loss/logits:9.2489777 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.987753
INFO:tensorflow:2019.03.30-21:26:40 step: 25 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2211275 log_pplx/logits:9.2211275 loss:9.2211275 loss/logits:9.2211275 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.675498
INFO:tensorflow:2019.03.30-21:26:47 step: 26 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.1932364 log_pplx/logits:9.1932364 loss:9.1932364 loss/logits:9.1932364 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.539548
INFO:tensorflow:2019.03.30-21:26:54 step: 27 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2337065 log_pplx/logits:9.2337065 loss:9.2337065 loss/logits:9.2337065 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.667554
INFO:tensorflow:2019.03.30-21:27:01 step: 28 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2273502 log_pplx/logits:9.2273502 loss:9.2273502 loss/logits:9.2273502 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.178711
INFO:tensorflow:2019.03.30-21:27:08 step: 29 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2205391 log_pplx/logits:9.2205391 loss:9.2205391 loss/logits:9.2205391 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.959177
INFO:tensorflow:2019.03.30-21:27:15 step: 30 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2193136 log_pplx/logits:9.2193136 loss:9.2193136 loss/logits:9.2193136 num_samples_in_batch:128 lr:0.00025000001
i have seen lingvo/lingvo/tasks/asr/params/librispeech.py, your training one step may be consume 1s, can you give me some advice?
system info: CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 40% 65C P2 81W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A |
| 42% 68C P2 84W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 45% 72C P2 95W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 45% 73C P2 86W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
this is the utilization rate of the GPU,sometime it was zero, but i have recorded time of data pipeline and found it very fast, about 0.2 s each step when i change data pipeline to mine.
Right, it seems that the GPU utilization is pretty low. Can you try some more tensorflow profiling to see where the time is spent?
that's tensorflow timeline profilling timeline_3.json
OK, I'm not an expert in reading traces. I don't really see a smoking gun. Can you follow the instructions to profile, especially trying "advise" and the profiler-ui to see if you find anything useful?
I once suspected the version of tensorflow and try many version of tf-nightly-gpu, but have no effect; now my env is : tf_nightly_gpu-1.13.0.dev20181210 , CUDA-9.0 , CUDNN: 7.3.0 . is right?
then, I have try tensorboard to watch compute time, however it did't display costed time, like
next i have refer to your advise to profile, watch the timeline and can't find any evidence; timeline_20.json
it looks like spend much time before Forward and backward, like op 'RandomStandardNormal' , but i don't konw how to repair it.
The env looks fine. Is this in sync mode or async mode? I checked our run it was in async mode with 8 machines x 2 gpus and got 0.9 steps/s. Since you are using 4 gpus you should expect ~0.2 steps/s and it seems you are getting slightly lower than that.
Sync mode and async mode all have been trying:
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false
bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer
Am I running the command correctly at local ?
my lingvo code version is Oct 2018 ,and it was different form now version. hence i recompile it, but error occured:
[dingzhenyou@gpu46 lingvo]$ bazel build -c opt //lingvo:trainer
INFO: Build options have changed, discarding analysis cache.
INFO: Analysed target //lingvo:trainer (37 packages loaded).
INFO: Found 1 target...
ERROR: /data/dingzhenyou/lingvo/lingvo/core/ops/BUILD:314:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:10:40: fatal error: google/protobuf/port_def.inc: No such file or directory
#include <google/protobuf/port_def.inc>
^
compilation terminated.
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.402s, Critical Path: 5.42s
INFO: 9 processes: 9 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
that's my env:
== cat /etc/issue ===============================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902
== uname -a =====================================================
Linux gpu46 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy (1.14.5)
numpydoc (0.6.0)
protobuf (3.6.1)
tensorflow-estimator (1.13.0)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.13.0-dev20181210
tf.GIT_VERSION = v1.12.0-rc0-4213-g54b110ae43
tf.COMPILER_VERSION = v1.12.0-rc0-4213-g54b110ae43
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /data/dingzhenyou/train_data_loader/deps/boost/lib/:/data/dingzhenyou/train_data_loader/deps/opencv/lib/:/data/dingzhenyou/cuda/lib64:/usr/local/cuda-9.0/lib:/usr/local/cuda-9.0/lib64:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Wed Apr 17 07:15:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 26% 46C P0 63W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A |
| 26% 47C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:82:00.0 Off | N/A |
| 24% 43C P0 64W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 42C P0 59W / 250W | 0MiB / 12196MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-9.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/lib64/libcudart_static.a
/usr/local/cuda-9.0/lib64/libcudart.so.9.0.176
When you update lingvo version you almost always will need to update tf-nightly version as well.
i have updated tf-nightly, and the training speed still very slow. that's my env : tf_env.txt and here is the training log, each step still spent 7~8s, nohup.txt
that training on async mode:
nohup bazel-bin/lingvo/trainer --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Grapheme --logdir=/data/dingzhenyou/speech_data/librispeech/log/ --logtostderr --enable_asserts=false --job=controller,trainer &
Actually I think that might be the right speed. We get ~1.1s per step on 16 P100s which would be ~5s per step on 4 P100s and P100 is supposedly maybe 30% faster than Titan Xp
Thanks, I will try more advanced machines.
@dingevin did it help to run on P100s/v100. I am getting around ~6-7 seconds per step (or Steps/second: 0.114680) on v100s (single host with 8 GPUs) in async mode. I was expecting it to be faster than P100s. From @jonathanasdf number, for 8 GPU it should be around ~3 seconds. Do you know any other reason for more time per step. I am using latest nightly build and lingvo and using default recipe without changes (expect for learning rate).
Hm, that's not good news...
Unfortunately it's very hard to debug performance issues remotely. Please try out some tensorflow profiling options to see if the GPUs are being utilized efficiently and whether the input pipeline is keeping up.
@datavizweb i have training lingvo on Tesla V100 lately, but the training speeds not be faster.
@jonathanasdf from my experiments, the GPUs Utilization was very high, but each step still slow. that's very strange. this is the log
and i have find the cpu load is very high, the load average numbers always > 5;
the Run queue is also high:
$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
48 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
$ cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
2
$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores : 12
$ cat /proc/cpuinfo| grep "processor"| wc -l
48
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 161664 5572960 231776 232380256 0 0 2 4 0 0 18 12 70 0 0
9 0 161664 5570836 231776 232390528 0 0 10240 0 38752 396294 25 2 73 0 0
62 0 161664 5572952 231776 232390176 0 0 0 0 35005 435967 13 2 85 0 0
6 0 161664 5551236 231776 232405296 0 0 15104 8 107598 484743 19 11 69 0 0
5 0 161664 5551484 231776 232405296 0 0 0 0 30623 508293 16 2 82 0 0
2 0 161664 5551732 231776 232405296 0 0 0 0 31678 177303 7 1 92 0 0
10 0 161664 5551776 231776 232405296 0 0 0 0 43848 262920 9 2 89 0 0
6 0 161664 5551904 231776 232405296 0 0 0 0 57625 445091 14 2 84 0 0
12 0 161664 5552776 231776 232405296 0 0 0 16 59159 443863 14 2 84 0 0
13 0 161664 5552280 231776 232405296 0 0 0 0 48835 525581 15 2 83 0 0
5 0 161664 5534872 231776 232425952 0 0 20608 0 113457 338928 17 10 73 0 0
4 0 161664 5534932 231776 232425952 0 0 0 0 37296 666458 18 3 79 0 0
1 0 161664 5535120 231776 232425952 0 0 0 0 36611 494778 13 2 84 0 0
8 0 161664 5541208 231776 232486576 0 0 61696 0 41747 370774 24 3 73 0 0
9 0 161664 5540580 231776 232487808 0 0 0 16 36153 728851 19 3 78 0 0
1 0 161664 5535660 231776 232488320 0 0 0 0 27509 145425 8 1 91 0 0
3 0 161664 5537648 231776 232487968 0 0 0 0 35210 176295 7 1 92 0 0
3 0 161664 5538180 231776 232487968 0 0 0 0 52778 319345 10 2 87 0 0
4 0 161664 5538068 231776 232487968 0 0 0 0 55697 442565 13 2 85 0 0
9 0 161664 5538332 231776 232487968 0 0 0 4 56004 445039 13 2 85 0 0
10 0 161664 5538956 231776 232487968 0 0 0 0 49435 502612 15 2 83 0 0
10 0 161664 5538716 231776 232488720 0 0 896 0 52583 245029 14 8 78 0 0
8 0 161664 5531764 231776 232488864 0 0 0 0 35941 582441 17 3 80 0 0
12 0 161664 5532012 231776 232488864 0 0 0 0 25962 328696 12 2 86 0 0
64 0 161664 5532880 231776 232488864 0 0 0 0 32843 206023 8 1 91 0 0
10 0 161664 5511304 231776 232488864 0 0 0 39 61909 448895 14 2 84 0 0
7 0 161664 5511056 231776 232488864 0 0 0 0 60732 441861 14 2 84 0 0
It seems like there is a dip in the GPU graph every 6-8 seconds, which also corresponds to your step time. Could it be maybe the CPU/disk cannot keep up with the input processing?
I have test the disk IO, but didn't see any bottleneck, that' s the test info librispeech.log, and i have placed the training data in shared server to test network IO, i can't seen anything. this is the network IO:
Sorry, I tried checking with a few people on our side and we didn't have any other hypothesis for why it's slower for you :(
@dingevin we can generate random features (Batches in memory) and give it to model from input_generator. If it still takes 6-7 seconds per step, I/O is not an issue. We then need to profile and figure out exact component that is taking more time. As my next debugging step, I am planning to do that end of next week. If you have time, can you check it?
@jonathanasdf
Related to this issue and how the speeds are measured and compared in this thread: are the metrics printed during the run cumulative or per GPU or both? It seems that the steps/sec is fairly constant, but the number of examples per second grows linearly based on the GPUs. And clarification on this would be appreciated.
GPUs Steps/sec Examples/sec
1 0.128 6.484
2 0.111 11.513
4 0.103 21.002
8 0.105 42.436
I believe that depends on whether you are running in sync mode (the default) or async mode.
In sync mode all GPUs need to complete their computation before the step is updated, so the step rate is constant. In async mode each GPU updates parameters independently so the step rate should scale with the number of GPUs.
watch this.
Could you guys give some tips on how to profile with GPipe? I've created a model thats GPipe-compatible (uses 4 gpus at the moment). Any particular tools?