lingvo
lingvo copied to clipboard
Why Lingvo is so slow when training Librispeech960Wpm on a host with 6 GPUs
my host is one host, 6 GPUs(V100), the speed is about steps/sec: 0.12, so the 800,000 steps would take several months.
My running command is as follows: bazel-bin/lingvo/trainer --saver_max_to_keep=3 --worker_gpus=6 --worker_replicas=3 --run_locally=gpu --mode=async --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/lingvo/asr_2 --logtostderr --enable_asserts=false --job=controller,trainer
through nvidia-smi: I can see that sin GPU "GPU-Util" are beteen 30%-60%
Any suggestions for the configuration?
Thanks
2021-03-19 00:03:38.408969: I lingvo/core/ops/record_yielder.cc:604] Record 1963: key=00001962 2021-03-19 00:03:39.040448: I lingvo/core/ops/record_yielder.cc:614] Emitted 2551 records from /tmp/lingvo/lingvo/speech_data/train/train.tfrecords-00034-of-00100 I0319 00:03:48.392294 139978034091776 summary_utils.py:398] Steps/second: 0.123965, Examples/second: 71.403573 I0319 00:03:48.394137 139978034091776 trainer_impl.py:199] step: 223, steps/sec: 0.12, examples/sec: 71.40 fraction_of_correct_next_step_preds:0.14076507 fraction_of_correct_next_step_preds/logits:0.14076507 grad_norm/all/loss:0.5464555 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58279711 learning_rate/loss:0.00025000001 log_pplx:6.1487246 log_pplx/logits:6.1487246 loss:6.1487246 loss/logits:6.1487246 num_samples_in_batch:576 token_normed_prob:0.0021416517 token_normed_prob/logits:0.0021416517 var_norm/all/loss:1079.627 2021-03-19 00:03:48.516815: I lingvo/core/ops/record_yielder.cc:604] Record 2094: key=00002093 I0319 00:03:57.226976 139978034091776 summary_utils.py:398] Steps/second: 0.124019, Examples/second: 71.434948 I0319 00:03:57.228809 139978034091776 trainer_impl.py:199] step: 224, steps/sec: 0.12, examples/sec: 71.43 fraction_of_correct_next_step_preds:0.14581797 fraction_of_correct_next_step_preds/logits:0.14581797 grad_norm/all/loss:0.53552681 grad_scale_all/loss:1 has_nan_or_inf/loss:0 l2_loss/loss:0.58274323 learning_rate/loss:0.00025000001 log_pplx:6.1034293 log_pplx/logits:6.1034293 loss:6.1034293 loss/logits:6.1034293 num_samples_in_batch:576 token_normed_prob:0.0022410566 token_normed_prob/logits:0.0022410566 var_norm/all/loss:1079.577
Tensorflow has some profiling guides: https://www.tensorflow.org/guide/profiler https://www.tensorflow.org/guide/gpu_performance_analysis
One important thing to check is if the training is disk io bounded. If that turns out to be the case you may need to consider moving the data onto an SSD. I believe you should be able to get 1.2 steps/sec with 16 GPUs, so 0.45 steps/sec with 6 GPUs.
Hi, I have 4Gpus(v100) and I want to try to run this model. But I don't know what the number of saver_max_to_keep & worker_replicas means. Should I set the same number as you did?
I even don't know what exact they mean. I guess from their name: saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk worker_replicas : The number workers do the actual work of training the model in parallel I hope @jonathanasdf can confirm.
Yes that is correct. worker_replicas = number of machines you have in the cluster, worker_gpus = number of gpus per machine. Actually reading #1 I think worker_replicas should be set to 1, because you only have 1 machine with 6 gpus in it. But I guess setting it to larger number still works.
I even don't know what exact they mean. I guess from their name: saver_max_to_keep : keep this number of history models on disk. if it not set, many history models would be saved, which takes huge disk worker_replicas : The number workers do the actual work of training the model in parallel I hope @jonathanasdf can confirm.
iirc, saver_max_to_keep refers to the max number of checkpointing files stored. @yujunlhz