wenet
wenet copied to clipboard
Multilingual training Loss oscillates between 100
I integrate data from four languages for training,data sizes for the four languages are 15,30,100 and 98 hours, respectively, the dict The is over ten thousand in size。The loss went from 300 to 100 and fluctuated back and forth between 100。Except for more than ten hours of data without individual training, the rest of the individual training was normal。I thought it was caused by the language of ten hours of data, but after I removed the language, it still didn't work。What might be the cause?
2021-12-31 22:45:28,793 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:46:09,455 DEBUG TRAIN Batch 27/0 loss 6.248672 loss_att 5.813747 loss_ctc 7.263494 lr 0.00040742 rank 0 2021-12-31 22:46:09,466 DEBUG TRAIN Batch 27/0 loss 6.659250 loss_att 6.425966 loss_ctc 7.203578 lr 0.00040836 rank 2 2021-12-31 22:46:09,481 DEBUG TRAIN Batch 27/0 loss 7.473471 loss_att 6.575848 loss_ctc 9.567926 lr 0.00040791 rank 3 2021-12-31 22:46:09,528 DEBUG TRAIN Batch 27/0 loss 5.064744 loss_att 5.017971 loss_ctc 5.173881 lr 0.00040829 rank 1 2021-12-31 22:46:30,639 WARNING NaN or Inf found in input tensor. 2021-12-31 22:46:53,891 WARNING NaN or Inf found in input tensor. 2021-12-31 22:46:56,771 WARNING NaN or Inf found in input tensor. 2021-12-31 22:47:11,668 DEBUG TRAIN Batch 27/100 loss 102.163925 loss_att 94.791977 loss_ctc 119.365143 lr 0.00040371 rank 0 2021-12-31 22:47:11,669 DEBUG TRAIN Batch 27/100 loss 119.963242 loss_att 111.493553 loss_ctc 139.725861 lr 0.00040456 rank 1 2021-12-31 22:47:11,671 DEBUG TRAIN Batch 27/100 loss 105.903122 loss_att 95.042114 loss_ctc 131.245468 lr 0.00040463 rank 2 2021-12-31 22:47:11,716 DEBUG TRAIN Batch 27/100 loss 70.272079 loss_att 65.121017 loss_ctc 82.291222 lr 0.00040419 rank 3 2021-12-31 22:47:40,583 WARNING NaN or Inf found in input tensor. 2021-12-31 22:48:15,227 DEBUG TRAIN Batch 27/200 loss 180.245758 loss_att 169.502548 loss_ctc 205.313263 lr 0.00040057 rank 3 2021-12-31 22:48:15,231 DEBUG TRAIN Batch 27/200 loss 87.906181 loss_att 80.827728 loss_ctc 104.422562 lr 0.00040011 rank 0 2021-12-31 22:48:27,843 DEBUG CV Batch 27/0 loss 20.847515 loss_att 20.225395 loss_ctc 22.299129 history loss 20.013615 rank 0 2021-12-31 22:48:27,845 DEBUG CV Batch 27/0 loss 20.847515 loss_att 20.225395 loss_ctc 22.299129 history loss 20.013615 rank 3 2021-12-31 22:48:27,846 DEBUG CV Batch 27/0 loss 20.847515 loss_att 20.225395 loss_ctc 22.299129 history loss 20.013615 rank 2 2021-12-31 22:48:27,871 DEBUG CV Batch 27/0 loss 20.847515 loss_att 20.225395 loss_ctc 22.299129 history loss 20.013615 rank 1 2021-12-31 22:48:32,337 INFO Epoch 27 CV info cv_loss 99.84644016176958 2021-12-31 22:48:32,338 INFO Epoch 28 TRAIN info lr 0.0004005701054024485 2021-12-31 22:48:32,340 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:48:32,353 INFO Epoch 27 CV info cv_loss 99.84644016176958 2021-12-31 22:48:32,353 INFO Checkpoint: save to checkpoint exp/sp_spec_aug/27.pt 2021-12-31 22:48:32,400 INFO Epoch 27 CV info cv_loss 99.84644016176958 2021-12-31 22:48:32,400 INFO Epoch 28 TRAIN info lr 0.0004011785213699553 2021-12-31 22:48:32,402 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:48:32,472 INFO Epoch 27 CV info cv_loss 99.84644016176958 2021-12-31 22:48:32,473 INFO Epoch 28 TRAIN info lr 0.0004009634698823127 2021-12-31 22:48:32,476 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:48:32,875 INFO Epoch 28 TRAIN info lr 0.0004 2021-12-31 22:48:32,877 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:49:13,344 DEBUG TRAIN Batch 28/0 loss 2.557263 loss_att 2.285450 loss_ctc 3.191494 lr 0.00039996 rank 0 2021-12-31 22:49:13,368 DEBUG TRAIN Batch 28/0 loss 6.441474 loss_att 6.016290 loss_ctc 7.433571 lr 0.00040053 rank 3 2021-12-31 22:49:13,393 DEBUG TRAIN Batch 28/0 loss 6.693965 loss_att 6.355334 loss_ctc 7.484103 lr 0.00040114 rank 2 2021-12-31 22:49:13,401 DEBUG TRAIN Batch 28/0 loss 5.170708 loss_att 4.932330 loss_ctc 5.726925 lr 0.00040093 rank 1 2021-12-31 22:50:14,561 DEBUG TRAIN Batch 28/100 loss 84.658974 loss_att 77.700836 loss_ctc 100.894638 lr 0.00039701 rank 3 2021-12-31 22:50:14,563 DEBUG TRAIN Batch 28/100 loss 89.587128 loss_att 83.363190 loss_ctc 104.109634 lr 0.00039739 rank 1 2021-12-31 22:50:14,565 DEBUG TRAIN Batch 28/100 loss 95.653564 loss_att 87.251152 loss_ctc 115.259186 lr 0.00039760 rank 2 2021-12-31 22:50:14,566 DEBUG TRAIN Batch 28/100 loss 97.992355 loss_att 91.536926 loss_ctc 113.055038 lr 0.00039646 rank 0 2021-12-31 22:50:39,999 WARNING NaN or Inf found in input tensor. 2021-12-31 22:50:49,133 WARNING NaN or Inf found in input tensor. 2021-12-31 22:51:14,608 DEBUG TRAIN Batch 28/200 loss 144.108368 loss_att 133.475449 loss_ctc 168.918518 lr 0.00039395 rank 1 2021-12-31 22:51:25,658 DEBUG CV Batch 28/0 loss 18.946194 loss_att 18.939415 loss_ctc 18.962011 history loss 18.188346 rank 3 2021-12-31 22:51:25,660 DEBUG CV Batch 28/0 loss 18.946194 loss_att 18.939415 loss_ctc 18.962011 history loss 18.188346 rank 1 2021-12-31 22:51:25,672 DEBUG CV Batch 28/0 loss 18.946194 loss_att 18.939415 loss_ctc 18.962011 history loss 18.188346 rank 2 2021-12-31 22:51:25,683 DEBUG CV Batch 28/0 loss 18.946194 loss_att 18.939415 loss_ctc 18.962011 history loss 18.188346 rank 0 2021-12-31 22:51:30,206 INFO Epoch 28 CV info cv_loss 101.02439876984015 2021-12-31 22:51:30,207 INFO Epoch 29 TRAIN info lr 0.00039391929857916764 2021-12-31 22:51:30,209 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:51:30,218 INFO Epoch 28 CV info cv_loss 101.02439876984015 2021-12-31 22:51:30,219 INFO Epoch 29 TRAIN info lr 0.00039361402676449424 2021-12-31 22:51:30,219 INFO Epoch 28 CV info cv_loss 101.02439876984015 2021-12-31 22:51:30,219 INFO Epoch 29 TRAIN info lr 0.00039419124842072746 2021-12-31 22:51:30,221 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:51:30,222 INFO using accumulate grad, new batch size is 1 times larger than before 2021-12-31 22:51:30,256 INFO Epoch 28 CV info cv_loss 101.02439876984015 2021-12-31 22:51:30,257 INFO Checkpoint: save to checkpoint exp/sp_spec_aug/28.pt
I'm not sure. I think you can try to add the language one by one to figure out the problem.
Could it be that the amount of data in a particular language is too small? I also tried to add languages one by one, but it was the same problem
I got the same error when training with librispeech data. I solved the problem by reducing min_output_input_ratio from 0.05 to 0.01. I think removing the sample with the filter function caused the error.
@chiendb97 How many languages have you trained in? How many tokens are there for dictionary size
@chiendb97 How many languages have you trained in? How many tokens are there for dictionary size
@kaiAksenov I trained with LibriSpeech dataset and dictionary size is 5001
Has your problem been solved?
This issue has been automatically closed due to inactivity.