Exits training

Open argideritzalpea opened this issue 6 years ago • 1 comments

I am attempting to run the Librispeech training example (only the 100hrs data, not the 360 or 500). I have an issue where Lingvo successfully begins training and gets to one or two epochs. Then, training stalls out with no error warning printed to screen, and the log exits and goes to receive input from the command line (in the docker container).

Here is the output I get from the training: https://github.com/argideritzalpea/lingvo/blob/master/run.log

Is this suggestive of OOM? When I try to run nvidia-smi after this error occurs, the command stalls out and no information about the GPU appears.

I am running the command bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr \ --model=asr.librispeech.Librispeech960Grapheme --mode=sync \ --logdir=/tmp/lingvo/log --saver_max_to_keep=2 \ --run_locally=gpu 2>&1 |& tee run.log on one Tesla K80 on Google Cloud. Any ideas of how to debug this? Is there a setting that I could tweak to fix this? In librispeech.py, I have halved p.bucket_batch_limit ([48, 24, 24, 24, 24, 24, 24, 24]`) and modified

def Train(self):
    p = self._CommonInputParams(is_eval=False)
    p.file_datasource.file_pattern = 'train/train.tfrecords-*'
    p.num_samples = 28539
    return p

to agree with the reduced number of samples in the Librispeech100 data, as opposed to 960.

Apr 23 '20 02:04 argideritzalpea

I think that's likely some kind of OOM, but I have no idea why it would just quit without printing any kind of error.

If it runs fine with --run_locally=cpu then I guess it's a GPU oom.

You can try setting report_tensor_allocations_upon_oom=True in https://github.com/tensorflow/lingvo/blob/6138c9730a46d72015d608e24fb6c647dc4492d1/lingvo/trainer.py#L492

Example: https://github.com/tensorflow/tensorflow/issues/17076

It's also possible that it's a CPU OOM due to the input pipeline. In addition to making the bucket_batch_limit even smaller, you can also try setting file_buffer_size=1 (default 10000) and file_parallelism=1 (default 16).

Apr 23 '20 04:04 jonathanasdf