wenet icon indicating copy to clipboard operation
wenet copied to clipboard

WARNING NaN or Inf found in input tensor.

Open kaiAksenov opened this issue 4 years ago • 15 comments

Hello,I have a question to ask. I listened to the audio of my training. It's not empty audio. The audio sounds normal. It's normal to run with the old code before,The following message now appears:

2021-11-30 17:25:35,809 DEBUG TRAIN Batch 0/4000 loss inf loss_att 78.135910 loss_ctc inf lr 0.00001905 rank 0 2021-11-30 17:25:56,021 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:13,986 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:14,325 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,178 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,568 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:17,369 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:18,239 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:33,053 DEBUG TRAIN Batch 0/4100 loss 156.104248 loss_att 146.877640 loss_ctc 177.633026 lr 0.00001953 rank 0 2021-11-30 17:27:41,611 WARNING NaN or Inf found in input tensor. 2021-11-30 17:27:44,383 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:23,927 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:30,782 DEBUG TRAIN Batch 0/4200 loss 90.697769 loss_att 89.033356 loss_ctc 94.581398 lr 0.00002000 rank 0 2021-11-30 17:29:08,341 WARNING NaN or Inf found in input tensor. 2021-11-30 17:29:34,919 DEBUG TRAIN Batch 0/4300 loss 193.539017 loss_att 188.552368 loss_ctc 205.174561 lr 0.00002048 rank 0 2021-11-30 17:30:41,968 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:46,109 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:51,037 WARNING NaN or Inf found in input tensor. 2021-11-30 17:31:33,866 DEBUG TRAIN Batch 0/4400 loss 155.064835 loss_att 150.905930 loss_ctc 164.768936 lr 0.00002096 rank 0

What may be the reason for this?

kaiAksenov avatar Nov 30 '21 10:11 kaiAksenov

We are not sure. I think you can just ignore the warning and continue the training. The final WER should be comparable to the old code.

robin1001 avatar Nov 30 '21 12:11 robin1001

I also encountered the same problem as you. In my situation, I filter out the audios that sounded normal but their duration less than 1 second, and the train become normal as before。

zelda3721 avatar Dec 07 '21 13:12 zelda3721

@kaiAksenov @zelda3721
did you add use_amp option for training?

iou2much avatar Dec 15 '21 08:12 iou2much

@zelda3721 did you add use_amp option for training?

kaiAksenov avatar Dec 15 '21 08:12 kaiAksenov

@kaiAksenov @zelda3721 did you add use_amp option for training? @iou2much No,I didn't add this option

kaiAksenov avatar Dec 15 '21 08:12 kaiAksenov

I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists.

Besides the warning, I also get loss inf / loss_ctc inf:

2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor.
2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0
2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1

etc

jwvl avatar Dec 15 '21 20:12 jwvl

FYI, in my case, after I undo the use_amp option, the NaN loss disappear

iou2much avatar Dec 16 '21 01:12 iou2much

@kaiAksenov @zelda3721 did you add use_amp option for training? @iou2much No,I didn't add this option

I didn't add this option

zelda3721 avatar Dec 16 '21 05:12 zelda3721

the use_amp option

I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists.

Besides the warning, I also get loss inf / loss_ctc inf:

2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor.
2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0
2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1

etc

you can add the use_amp option to try

kaiAksenov avatar Dec 16 '21 09:12 kaiAksenov

the use_amp option

I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists. Besides the warning, I also get loss inf / loss_ctc inf:

2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor.
2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0
2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1

etc

you can add the use_amp option to try

i encountered the same problem, and i solved it with that, add "split_with_space: true" under "dataset_conf" in conf/train_conformer.yaml you can see your dataset's label: word by word split by blank, but your labels not processed by this if "split_with_space" not configured hope it will help you

bigcash avatar Dec 17 '21 08:12 bigcash

@bigcash This did the trick for me, too! Thanks for sharing this advice.

jwvl avatar Dec 20 '21 14:12 jwvl

I used your method to solve the problem, but there is a new problem: how to solve the problem if the loss is maintained for another 80

rookie0607 avatar Jan 16 '22 17:01 rookie0607

@rookie0607 How many hours of training data do you have?

kaiAksenov avatar Jan 17 '22 03:01 kaiAksenov

@rookie0607 How many hours of training data do you have?

1.3kh

rookie0607 avatar Jan 17 '22 06:01 rookie0607

Hello,I have a question to ask. I listened to the audio of my training. It's not empty audio. The audio sounds normal. It's normal to run with the old code before,The following message now appears:

2021-11-30 17:25:35,809 DEBUG TRAIN Batch 0/4000 loss inf loss_att 78.135910 loss_ctc inf lr 0.00001905 rank 0 2021-11-30 17:25:56,021 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:13,986 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:14,325 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,178 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,568 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:17,369 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:18,239 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:33,053 DEBUG TRAIN Batch 0/4100 loss 156.104248 loss_att 146.877640 loss_ctc 177.633026 lr 0.00001953 rank 0 2021-11-30 17:27:41,611 WARNING NaN or Inf found in input tensor. 2021-11-30 17:27:44,383 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:23,927 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:30,782 DEBUG TRAIN Batch 0/4200 loss 90.697769 loss_att 89.033356 loss_ctc 94.581398 lr 0.00002000 rank 0 2021-11-30 17:29:08,341 WARNING NaN or Inf found in input tensor. 2021-11-30 17:29:34,919 DEBUG TRAIN Batch 0/4300 loss 193.539017 loss_att 188.552368 loss_ctc 205.174561 lr 0.00002048 rank 0 2021-11-30 17:30:41,968 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:46,109 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:51,037 WARNING NaN or Inf found in input tensor. 2021-11-30 17:31:33,866 DEBUG TRAIN Batch 0/4400 loss 155.064835 loss_att 150.905930 loss_ctc 164.768936 lr 0.00002096 rank 0

What may be the reason for this?

It may be the reason of infinite CTC losses. You can try to set zero_infinity=True during CTCLoss initialization. Wish it helps you.

leeshion11 avatar Mar 15 '22 01:03 leeshion11

I also encountered the same problem, I'm trying to add some noise wavs to <unk> label, however, when the time of wav is less than 0.5 seconds, the warning comes, and the final model performs worse than Kaldi traditional model in the noisy scene

fclearner avatar Feb 16 '23 09:02 fclearner

Solved, If u use char (rather then bpe) as modeling unit for english dataset, remember to add split_with_space = True in configuration.

xingchensong avatar Feb 21 '23 05:02 xingchensong