WARNING NaN or Inf found in input tensor.
Hello,I have a question to ask. I listened to the audio of my training. It's not empty audio. The audio sounds normal. It's normal to run with the old code before,The following message now appears:
2021-11-30 17:25:35,809 DEBUG TRAIN Batch 0/4000 loss inf loss_att 78.135910 loss_ctc inf lr 0.00001905 rank 0 2021-11-30 17:25:56,021 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:13,986 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:14,325 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,178 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,568 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:17,369 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:18,239 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:33,053 DEBUG TRAIN Batch 0/4100 loss 156.104248 loss_att 146.877640 loss_ctc 177.633026 lr 0.00001953 rank 0 2021-11-30 17:27:41,611 WARNING NaN or Inf found in input tensor. 2021-11-30 17:27:44,383 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:23,927 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:30,782 DEBUG TRAIN Batch 0/4200 loss 90.697769 loss_att 89.033356 loss_ctc 94.581398 lr 0.00002000 rank 0 2021-11-30 17:29:08,341 WARNING NaN or Inf found in input tensor. 2021-11-30 17:29:34,919 DEBUG TRAIN Batch 0/4300 loss 193.539017 loss_att 188.552368 loss_ctc 205.174561 lr 0.00002048 rank 0 2021-11-30 17:30:41,968 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:46,109 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:51,037 WARNING NaN or Inf found in input tensor. 2021-11-30 17:31:33,866 DEBUG TRAIN Batch 0/4400 loss 155.064835 loss_att 150.905930 loss_ctc 164.768936 lr 0.00002096 rank 0
What may be the reason for this?
We are not sure. I think you can just ignore the warning and continue the training. The final WER should be comparable to the old code.
I also encountered the same problem as you. In my situation, I filter out the audios that sounded normal but their duration less than 1 second, and the train become normal as before。
@kaiAksenov @zelda3721
did you add use_amp option for training?
@zelda3721 did you add use_amp option for training?
@kaiAksenov @zelda3721 did you add use_amp option for training? @iou2much No,I didn't add this option
I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists.
Besides the warning, I also get loss inf / loss_ctc inf:
2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor.
2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor.
2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0
2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1
etc
FYI, in my case, after I undo the use_amp option, the NaN loss disappear
@kaiAksenov @zelda3721 did you add use_amp option for training? @iou2much No,I didn't add this option
I didn't add this option
the use_amp option
I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists.
Besides the warning, I also get loss inf / loss_ctc inf:
2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor. 2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor. 2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor. 2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0 2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1etc
you can add the use_amp option to try
the use_amp option
I am encountering the same error with the librispeech/s0 recipe (but using a custom dataset). I have tried filtering out segments shorter than 1 second from data.list, but the problem persists. Besides the warning, I also get loss inf / loss_ctc inf:
2021-12-15 19:58:55,626 WARNING NaN or Inf found in input tensor. 2021-12-15 19:58:57,472 WARNING NaN or Inf found in input tensor. 2021-12-15 19:59:36,155 WARNING NaN or Inf found in input tensor. 2021-12-15 19:59:36,190 DEBUG TRAIN Batch 0/8000 loss inf loss_att 54.358368 loss_ctc inf lr 0.00128016 rank 0 2021-12-15 19:59:36,192 DEBUG TRAIN Batch 0/8000 loss inf loss_att 44.816650 loss_ctc inf lr 0.00128016 rank 1etc
you can add the use_amp option to try
i encountered the same problem, and i solved it with that, add "split_with_space: true" under "dataset_conf" in conf/train_conformer.yaml you can see your dataset's label: word by word split by blank, but your labels not processed by this if "split_with_space" not configured hope it will help you
@bigcash This did the trick for me, too! Thanks for sharing this advice.
I used your method to solve the problem, but there is a new problem: how to solve the problem if the loss is maintained for another 80
@rookie0607 How many hours of training data do you have?
@rookie0607 How many hours of training data do you have?
1.3kh
Hello,I have a question to ask. I listened to the audio of my training. It's not empty audio. The audio sounds normal. It's normal to run with the old code before,The following message now appears:
2021-11-30 17:25:35,809 DEBUG TRAIN Batch 0/4000 loss inf loss_att 78.135910 loss_ctc inf lr 0.00001905 rank 0 2021-11-30 17:25:56,021 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:13,986 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:14,325 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,178 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:15,568 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:17,369 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:18,239 WARNING NaN or Inf found in input tensor. 2021-11-30 17:26:33,053 DEBUG TRAIN Batch 0/4100 loss 156.104248 loss_att 146.877640 loss_ctc 177.633026 lr 0.00001953 rank 0 2021-11-30 17:27:41,611 WARNING NaN or Inf found in input tensor. 2021-11-30 17:27:44,383 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:23,927 WARNING NaN or Inf found in input tensor. 2021-11-30 17:28:30,782 DEBUG TRAIN Batch 0/4200 loss 90.697769 loss_att 89.033356 loss_ctc 94.581398 lr 0.00002000 rank 0 2021-11-30 17:29:08,341 WARNING NaN or Inf found in input tensor. 2021-11-30 17:29:34,919 DEBUG TRAIN Batch 0/4300 loss 193.539017 loss_att 188.552368 loss_ctc 205.174561 lr 0.00002048 rank 0 2021-11-30 17:30:41,968 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:46,109 WARNING NaN or Inf found in input tensor. 2021-11-30 17:30:51,037 WARNING NaN or Inf found in input tensor. 2021-11-30 17:31:33,866 DEBUG TRAIN Batch 0/4400 loss 155.064835 loss_att 150.905930 loss_ctc 164.768936 lr 0.00002096 rank 0
What may be the reason for this?
It may be the reason of infinite CTC losses. You can try to set zero_infinity=True during CTCLoss initialization. Wish it helps you.
I also encountered the same problem, I'm trying to add some noise wavs to <unk> label, however, when the time of wav is less than 0.5 seconds, the warning comes, and the final model performs worse than Kaldi traditional model in the noisy scene
Solved, If u use char (rather then bpe) as modeling unit for english dataset, remember to add split_with_space = True in configuration.