lingvo
lingvo copied to clipboard
GPU utilization down to 0% without any error infos
Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.
The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.

So anyone know what's going on?
In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!
Can you kill the job and restart it? It should resume training.
I've never seen it just stop progressing without any error messages, so I have no idea what could be going on.
Thank you~ I have experienced this situation twice, and it almost happened that after two or three days of running. I stopped it and restarted, it can resume training.
Could you please have a look at this: I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!
I have tried many ways to solve it, but it still not works. Besides, I did not see the installation of tensorflow in the dockerfile. However, the 1.14.1 version of tensorflow will appear after installing the environment via docker. But, without docker the installation of the 1.14.1 version of tensorflow needs to be installed from the source, because pypi library doesn't have it. I wanna know why docker can install that version directly. Is my problem related to the version of tensorflow?
The docker installs tf-nightly (here https://github.com/tensorflow/lingvo/blob/e649e651e80ec1ad092a4d6777486ace5ea2c3f9/docker/dev.dockerfile#L75)
That might be the source of problem if you have a different tensorflow version.
That's right! It’s really the problem. Thanks!
I am seeing the same issue with libri recipe.
While running libri grapheme recipe (with default params, no changes to recipe) I see that loss starts reducing over steps. But after some times, I see that losses are not computed at all. I also see that it remains in same step (same step is check pointed again). It happened first time and killed and restarted. After few more steps, GPU utilization became zero again.
** runs with decent GPU utilization, loss reduces, steps per second seems okay too **
I0505 21:07:48.797380 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137719
I0505 21:07:53.444988 140580814300928 trainer.py:520] step: 11449 fraction_of_correct_next_step_preds:0.98058969 fraction_of_correct_next_step_preds/logits:0.98058969 grad_norm/all:1.6253868 grad_scale_all:0.61523813 log_pplx:0.062722519 log_pplx/logits:0.062722519 loss:0.062722519 loss/logits:0.06272
2519 num_samples_in_batch:384 var_norm/all:608.57135
I0505 21:07:58.806945 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137722
I0505 21:08:02.775715 140580814300928 trainer.py:520] step: 11450 fraction_of_correct_next_step_preds:0.98301238 fraction_of_correct_next_step_preds/logits:0.98301238 grad_norm/all:1.4966037 grad_scale_all:0.66817957 log_pplx:0.054031234 log_pplx/logits:0.054031234 loss:0.054031234 loss/logits:0.05403
1234 num_samples_in_batch:384 var_norm/all:608.56183
from here on loss are not computed. GPU usage becomes zero
I0505 21:08:08.816323 140580822693632 trainer.py:371] Steps/second: 0.117520, Examples/second: 48.137720 I0505 21:08:18.826544 140580822693632 trainer.py:371] Steps/second: 0.117506, Examples/second: 48.132058 I0505 21:08:28.836873 140580822693632 trainer.py:371] Steps/second: 0.117492, Examples/second: 48.126397 I0505 21:08:38.846771 140580822693632 trainer.py:371] Steps/second: 0.117479, Examples/second: 48.120738 I0505 21:08:48.856947 140580822693632 trainer.py:371] Steps/second: 0.117465, Examples/second: 48.115080 I0505 21:08:58.866631 140580822693632 trainer.py:371] Steps/second: 0.117451, Examples/second: 48.109423 I0505 21:09:08.877096 140580822693632 trainer.py:371] Steps/second: 0.117437, Examples/second: 48.103767 I0505 21:09:18.887014 140580822693632 trainer.py:371] Steps/second: 0.117423, Examples/second: 48.098113
** Same chekpoint saved again ** I0505 22:16:35.073483 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450 I0505 22:16:42.773545 140580822693632 trainer.py:371] Steps/second: 0.112100, Examples/second: 45.917725 I0505 22:16:52.783426 140580822693632 trainer.py:371] Steps/second: 0.112088, Examples/second: 45.912573 I0505 22:17:02.793808 140580822693632 trainer.py:371] Steps/second: 0.112075, Examples/second: 45.907422
I0505 22:26:35.644196 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450 I0505 22:26:43.352466 140580822693632 trainer.py:371] Steps/second: 0.111351, Examples/second: 45.610651 I0505 22:26:53.362185 140580822693632 trainer.py:371] Steps/second: 0.111338, Examples/second: 45.605568 I0505 22:27:03.372393 140580822693632 trainer.py:371] Steps/second: 0.111326, Examples/second: 45.600485
I0506 04:06:55.412421 140580822693632 trainer.py:270] Save checkpoint done: /tmp/librispeech/train/ckpt-00011450 I0506 04:07:03.148743 140580822693632 trainer.py:371] Steps/second: 0.090723, Examples/second: 37.161114 I0506 04:07:13.158670 140580822693632 trainer.py:371] Steps/second: 0.090714, Examples/second: 37.157740 I0506 04:07:23.168779 140580822693632 trainer.py:371] Steps/second: 0.090706, Examples/second: 37.154366 I0506 04:07:33.178835 140580822693632 trainer.py:371] Steps/second: 0.090698, Examples/second: 37.150993
It's very strange. My experiment didn't continue to display any info and also not save ckpt.
In the async mode I am seeing the same issue. It stops after say 14k steps and GPU utilization becomes zero. Memory usage remains the same. Unlike sync mode (previous post) I don't see any progress here. It runs normal after I kill it and restart.
In the async mode with two trainers for Librispeech960Wpm, I observed exactly the same phenomenon as datavizweb reported. The same step (33299) is checkpointed again and again.
I wonder if it is some kind of threading issue / race condition due to running controller and trainer in the same binary. Internally we always run the jobs as separate binaries and have never observed this problem. That is the only real difference I can think of.
@iamxiaoyubei Did you solve the problem? I meet the same problem.
I didn't solve this problem. I just restart running again to deal with it.😂
I solve the problem by set: export TF_CUDNN_USE_AUTOTUNE=0