wenet CV loss oscillates near 140 from start

I ran custom dataset with librispeech/s0/run.sh and conf/train_conformer.yaml and CV loss oscillates near 140 from start without improvement. What could be the reason for this? Thanks in advance.

Feb 11 '22 14:02 RuslanSel

check your data prepare
try to use smaller learning_rate

Feb 12 '22 03:02 robin1001

I also encountered the same problem. Have you solved it?

Feb 13 '22 07:02 rookie0607

I've tried lr 0.003, 0.001 and 0.0005, but without success.

Feb 13 '22 08:02 RuslanSel

How about u2++_conformer?

Feb 21 '23 04:02 xingchensong

I meet the same problem, too. In my 36k hours Chinese + English + Mandarin-English codeswitch data, I use u2++conformer, with config of

output_size: 256    # dimension of attention
attention_heads: 4
linear_units: 2048  # the number of units of position-wise feed forward

and lr: 0.001

I Found the cv_loss is abnormal, then in third Epoch I decrease the lr to its half(0.0005). the cv_loss is 98 after total 5 epochs! Here is the log:

2023-01-30 18:47:07,107 DEBUG TRAIN Batch 4/203700 loss 12.646465 loss_att 10.921429 loss_ctc 16.671551 lr 0.00032621 rank 0
2023-01-30 18:47:16,614 DEBUG TRAIN Batch 4/203700 loss 16.885244 loss_att 13.554604 loss_ctc 24.656742 lr 0.00032621 rank 1
2023-01-30 18:47:22,712 DEBUG TRAIN Batch 4/203700 loss 19.729301 loss_att 16.203592 loss_ctc 27.955961 lr 0.00032621 rank 2
2023-01-30 18:49:20,301 DEBUG TRAIN Batch 4/203800 loss 23.600426 loss_att 18.615641 loss_ctc 35.231586 lr 0.00032619 rank 2
2023-01-30 18:49:20,315 DEBUG TRAIN Batch 4/203800 loss 18.340179 loss_att 15.005867 loss_ctc 26.120243 lr 0.00032619 rank 1
2023-01-30 18:49:20,438 DEBUG TRAIN Batch 4/203800 loss 25.618004 loss_att 20.509174 loss_ctc 37.538609 lr 0.00032619 rank 0
2023-01-30 18:49:55,773 DEBUG CV Batch 4/0 loss 28.517015 loss_att 24.142754 loss_ctc 38.723625 history loss 27.957858 rank 2
2023-01-30 18:49:56,046 DEBUG CV Batch 4/0 loss 28.402842 loss_att 24.164579 loss_ctc 38.292122 history loss 27.845923 rank 0
2023-01-30 18:49:56,076 DEBUG CV Batch 4/0 loss 28.509480 loss_att 24.174435 loss_ctc 38.624588 history loss 27.950470 rank 3
2023-01-30 18:49:56,101 DEBUG CV Batch 4/0 loss 28.402861 loss_att 24.164606 loss_ctc 38.292122 history loss 27.845942 rank 1
2023-01-30 18:50:31,741 DEBUG CV Batch 4/100 loss 30.877792 loss_att 24.287479 loss_ctc 46.255188 history loss 80.243644 rank 3
2023-01-30 18:50:31,741 DEBUG CV Batch 4/100 loss 31.406693 loss_att 24.250772 loss_ctc 48.103836 history loss 79.827912 rank 0
2023-01-30 18:50:31,970 DEBUG CV Batch 4/100 loss 30.851536 loss_att 24.290764 loss_ctc 46.160007 history loss 79.778125 rank 2
2023-01-30 18:50:32,342 DEBUG CV Batch 4/100 loss 31.406673 loss_att 24.250732 loss_ctc 48.103867 history loss 79.008956 rank 1
2023-01-30 18:51:05,132 DEBUG CV Batch 4/200 loss 46.388836 loss_att 37.140503 loss_ctc 67.968277 history loss 89.078245 rank 3
2023-01-30 18:51:05,888 DEBUG CV Batch 4/200 loss 46.388817 loss_att 37.140484 loss_ctc 67.968262 history loss 88.649869 rank 0
2023-01-30 18:51:06,272 DEBUG CV Batch 4/200 loss 46.388779 loss_att 37.140423 loss_ctc 67.968269 history loss 88.688824 rank 2
2023-01-30 18:51:07,147 DEBUG CV Batch 4/200 loss 46.388771 loss_att 37.140415 loss_ctc 67.968262 history loss 88.296744 rank 1
2023-01-30 18:51:30,497 DEBUG CV Batch 4/300 loss 42.442574 loss_att 33.455048 loss_ctc 63.413464 history loss 83.552503 rank 3
2023-01-30 18:51:32,700 DEBUG CV Batch 4/300 loss 42.442581 loss_att 33.455055 loss_ctc 63.413464 history loss 83.255786 rank 0
2023-01-30 18:51:32,858 DEBUG CV Batch 4/300 loss 42.442566 loss_att 33.455044 loss_ctc 63.413460 history loss 83.288357 rank 2
2023-01-30 18:51:33,321 DEBUG CV Batch 4/300 loss 42.116375 loss_att 33.400871 loss_ctc 62.452549 history loss 83.019157 rank 1
2023-01-30 18:51:58,000 DEBUG CV Batch 4/400 loss 47.048725 loss_att 37.091206 loss_ctc 70.282936 history loss 81.040529 rank 3
2023-01-30 18:51:58,237 DEBUG CV Batch 4/400 loss 47.314903 loss_att 37.104877 loss_ctc 71.138306 history loss 80.788295 rank 2
2023-01-30 18:51:58,431 DEBUG CV Batch 4/400 loss 47.050179 loss_att 37.100052 loss_ctc 70.267143 history loss 80.756375 rank 0
2023-01-30 18:52:01,045 DEBUG CV Batch 4/400 loss 47.088219 loss_att 37.086498 loss_ctc 70.425568 history loss 80.567495 rank 1
2023-01-30 18:52:32,657 DEBUG CV Batch 4/500 loss 38.062759 loss_att 30.567739 loss_ctc 55.551136 history loss 83.286071 rank 3
2023-01-30 18:52:32,728 DEBUG CV Batch 4/500 loss 37.896431 loss_att 30.466656 loss_ctc 55.232578 history loss 83.022444 rank 0
2023-01-30 18:52:33,422 DEBUG CV Batch 4/500 loss 37.911812 loss_att 30.471275 loss_ctc 55.273056 history loss 83.058120 rank 2
2023-01-30 18:52:34,479 DEBUG CV Batch 4/500 loss 38.062759 loss_att 30.567730 loss_ctc 55.551151 history loss 82.813904 rank 1
2023-01-30 18:53:14,712 DEBUG CV Batch 4/600 loss 47.122044 loss_att 38.425232 loss_ctc 67.414604 history loss 88.885744 rank 3
2023-01-30 18:53:14,742 DEBUG CV Batch 4/600 loss 46.342560 loss_att 38.402050 loss_ctc 64.870415 history loss 88.531659 rank 2
2023-01-30 18:53:14,749 DEBUG CV Batch 4/600 loss 47.122055 loss_att 38.425243 loss_ctc 67.414604 history loss 88.823756 rank 0
2023-01-30 18:53:14,764 DEBUG CV Batch 4/600 loss 46.351166 loss_att 38.398983 loss_ctc 64.906250 history loss 88.422146 rank 1
2023-01-30 18:53:54,965 DEBUG CV Batch 4/700 loss 44.441872 loss_att 36.677708 loss_ctc 62.558262 history loss 92.181358 rank 3
2023-01-30 18:53:55,065 DEBUG CV Batch 4/700 loss 44.877800 loss_att 36.743958 loss_ctc 63.856770 history loss 92.239779 rank 0
2023-01-30 18:53:55,341 DEBUG CV Batch 4/700 loss 44.877766 loss_att 36.743904 loss_ctc 63.856777 history loss 92.046780 rank 1
2023-01-30 18:53:55,344 DEBUG CV Batch 4/700 loss 44.877800 loss_att 36.743954 loss_ctc 63.856770 history loss 92.113521 rank 2
2023-01-30 18:54:33,266 DEBUG CV Batch 4/800 loss 30.649742 loss_att 23.923923 loss_ctc 46.343323 history loss 92.472580 rank 0
2023-01-30 18:54:33,383 DEBUG CV Batch 4/800 loss 30.626713 loss_att 23.903809 loss_ctc 46.313492 history loss 92.436917 rank 3
2023-01-30 18:54:34,417 DEBUG CV Batch 4/800 loss 31.421110 loss_att 23.947063 loss_ctc 48.860554 history loss 92.407726 rank 2
2023-01-30 18:54:34,842 DEBUG CV Batch 4/800 loss 30.599287 loss_att 23.910414 loss_ctc 46.206657 history loss 92.288636 rank 1
2023-01-30 18:55:21,506 DEBUG CV Batch 4/900 loss 33.508255 loss_att 27.452938 loss_ctc 47.637329 history loss 98.493538 rank 3
2023-01-30 18:55:21,754 DEBUG CV Batch 4/900 loss 33.477356 loss_att 27.457464 loss_ctc 47.523773 history loss 98.568809 rank 0
2023-01-30 18:55:22,038 DEBUG CV Batch 4/900 loss 33.460239 loss_att 27.503674 loss_ctc 47.358894 history loss 98.430041 rank 1
2023-01-30 18:55:23,380 DEBUG CV Batch 4/900 loss 33.436291 loss_att 27.461983 loss_ctc 47.376347 history loss 98.415872 rank 2
2023-01-30 18:55:39,040 INFO Epoch 4 CV info cv_loss 98.36615292046831
2023-01-30 18:55:39,041 INFO Epoch 4 TRAIN info final lr 0.0003261806929873899
2023-01-30 18:55:39,041 INFO Checkpoint: save to checkpoint exp/conformer_wavaug1/4.pt
2023-01-30 18:55:39,092 INFO Epoch 4 CV info cv_loss 98.3194536657683
2023-01-30 18:55:39,093 INFO Epoch 4 TRAIN info final lr 0.0003262001287504334
2023-01-30 18:55:39,093 INFO Epoch 5 TRAIN info init lr 0.0003262001287504334
2023-01-30 18:55:39,097 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:39,544 INFO Epoch 4 CV info cv_loss 98.33389081785192
2023-01-30 18:55:39,545 INFO Epoch 4 TRAIN info final lr 0.0003261862457080438
2023-01-30 18:55:39,545 INFO Epoch 5 TRAIN info init lr 0.0003261862457080438
2023-01-30 18:55:39,549 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:39,867 INFO Epoch 4 CV info cv_loss 98.20686797192958
2023-01-30 18:55:39,867 INFO Epoch 4 TRAIN info final lr 0.0003261917987122867
2023-01-30 18:55:39,867 INFO Epoch 5 TRAIN info init lr 0.0003261917987122867
2023-01-30 18:55:39,869 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:41,034 INFO Epoch 5 TRAIN info init lr 0.0003261806929873899
2023-01-30 18:55:41,038 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:56:46,874 DEBUG TRAIN Batch 5/0 loss 17.460732 loss_att 14.531824 loss_ctc 24.294846 lr 0.00032619 rank 2
2023-01-30 18:56:46,879 DEBUG TRAIN Batch 5/0 loss 14.466322 loss_att 11.996610 loss_ctc 20.228985 lr 0.00032620 rank 3
2023-01-30 18:56:46,882 DEBUG TRAIN Batch 5/0 loss 18.419432 loss_att 15.551132 loss_ctc 25.112129 lr 0.00032618 rank 1
2023-01-30 18:56:46,882 DEBUG TRAIN Batch 5/0 loss 38.082077 loss_att 26.783554 loss_ctc 64.445297 lr 0.00032618 rank 0

Then I change the model config, with:

    output_size: 512    # dimension of attention
    attention_heads: 8
    linear_units: 2048  # the number of units of position-wise feed forward

and lr: 0.001 Then the model can be trained successfully.

I've tried to debug, but i still can't find the reason.

Apr 06 '23 06:04 duj12

I meet the same problem, too. In my 36k hours Chinese + English + Mandarin-English codeswitch data, I use u2++conformer, with config of

output_size: 256    # dimension of attention
attention_heads: 4
linear_units: 2048  # the number of units of position-wise feed forward

and lr: 0.001

I Found the cv_loss is abnormal, then in third Epoch I decrease the lr to its half(0.0005). the cv_loss is 98 after total 5 epochs! Here is the log:

2023-01-30 18:47:07,107 DEBUG TRAIN Batch 4/203700 loss 12.646465 loss_att 10.921429 loss_ctc 16.671551 lr 0.00032621 rank 0
2023-01-30 18:47:16,614 DEBUG TRAIN Batch 4/203700 loss 16.885244 loss_att 13.554604 loss_ctc 24.656742 lr 0.00032621 rank 1
2023-01-30 18:47:22,712 DEBUG TRAIN Batch 4/203700 loss 19.729301 loss_att 16.203592 loss_ctc 27.955961 lr 0.00032621 rank 2
2023-01-30 18:49:20,301 DEBUG TRAIN Batch 4/203800 loss 23.600426 loss_att 18.615641 loss_ctc 35.231586 lr 0.00032619 rank 2
2023-01-30 18:49:20,315 DEBUG TRAIN Batch 4/203800 loss 18.340179 loss_att 15.005867 loss_ctc 26.120243 lr 0.00032619 rank 1
2023-01-30 18:49:20,438 DEBUG TRAIN Batch 4/203800 loss 25.618004 loss_att 20.509174 loss_ctc 37.538609 lr 0.00032619 rank 0
2023-01-30 18:49:55,773 DEBUG CV Batch 4/0 loss 28.517015 loss_att 24.142754 loss_ctc 38.723625 history loss 27.957858 rank 2
2023-01-30 18:49:56,046 DEBUG CV Batch 4/0 loss 28.402842 loss_att 24.164579 loss_ctc 38.292122 history loss 27.845923 rank 0
2023-01-30 18:49:56,076 DEBUG CV Batch 4/0 loss 28.509480 loss_att 24.174435 loss_ctc 38.624588 history loss 27.950470 rank 3
2023-01-30 18:49:56,101 DEBUG CV Batch 4/0 loss 28.402861 loss_att 24.164606 loss_ctc 38.292122 history loss 27.845942 rank 1
2023-01-30 18:50:31,741 DEBUG CV Batch 4/100 loss 30.877792 loss_att 24.287479 loss_ctc 46.255188 history loss 80.243644 rank 3
2023-01-30 18:50:31,741 DEBUG CV Batch 4/100 loss 31.406693 loss_att 24.250772 loss_ctc 48.103836 history loss 79.827912 rank 0
2023-01-30 18:50:31,970 DEBUG CV Batch 4/100 loss 30.851536 loss_att 24.290764 loss_ctc 46.160007 history loss 79.778125 rank 2
2023-01-30 18:50:32,342 DEBUG CV Batch 4/100 loss 31.406673 loss_att 24.250732 loss_ctc 48.103867 history loss 79.008956 rank 1
2023-01-30 18:51:05,132 DEBUG CV Batch 4/200 loss 46.388836 loss_att 37.140503 loss_ctc 67.968277 history loss 89.078245 rank 3
2023-01-30 18:51:05,888 DEBUG CV Batch 4/200 loss 46.388817 loss_att 37.140484 loss_ctc 67.968262 history loss 88.649869 rank 0
2023-01-30 18:51:06,272 DEBUG CV Batch 4/200 loss 46.388779 loss_att 37.140423 loss_ctc 67.968269 history loss 88.688824 rank 2
2023-01-30 18:51:07,147 DEBUG CV Batch 4/200 loss 46.388771 loss_att 37.140415 loss_ctc 67.968262 history loss 88.296744 rank 1
2023-01-30 18:51:30,497 DEBUG CV Batch 4/300 loss 42.442574 loss_att 33.455048 loss_ctc 63.413464 history loss 83.552503 rank 3
2023-01-30 18:51:32,700 DEBUG CV Batch 4/300 loss 42.442581 loss_att 33.455055 loss_ctc 63.413464 history loss 83.255786 rank 0
2023-01-30 18:51:32,858 DEBUG CV Batch 4/300 loss 42.442566 loss_att 33.455044 loss_ctc 63.413460 history loss 83.288357 rank 2
2023-01-30 18:51:33,321 DEBUG CV Batch 4/300 loss 42.116375 loss_att 33.400871 loss_ctc 62.452549 history loss 83.019157 rank 1
2023-01-30 18:51:58,000 DEBUG CV Batch 4/400 loss 47.048725 loss_att 37.091206 loss_ctc 70.282936 history loss 81.040529 rank 3
2023-01-30 18:51:58,237 DEBUG CV Batch 4/400 loss 47.314903 loss_att 37.104877 loss_ctc 71.138306 history loss 80.788295 rank 2
2023-01-30 18:51:58,431 DEBUG CV Batch 4/400 loss 47.050179 loss_att 37.100052 loss_ctc 70.267143 history loss 80.756375 rank 0
2023-01-30 18:52:01,045 DEBUG CV Batch 4/400 loss 47.088219 loss_att 37.086498 loss_ctc 70.425568 history loss 80.567495 rank 1
2023-01-30 18:52:32,657 DEBUG CV Batch 4/500 loss 38.062759 loss_att 30.567739 loss_ctc 55.551136 history loss 83.286071 rank 3
2023-01-30 18:52:32,728 DEBUG CV Batch 4/500 loss 37.896431 loss_att 30.466656 loss_ctc 55.232578 history loss 83.022444 rank 0
2023-01-30 18:52:33,422 DEBUG CV Batch 4/500 loss 37.911812 loss_att 30.471275 loss_ctc 55.273056 history loss 83.058120 rank 2
2023-01-30 18:52:34,479 DEBUG CV Batch 4/500 loss 38.062759 loss_att 30.567730 loss_ctc 55.551151 history loss 82.813904 rank 1
2023-01-30 18:53:14,712 DEBUG CV Batch 4/600 loss 47.122044 loss_att 38.425232 loss_ctc 67.414604 history loss 88.885744 rank 3
2023-01-30 18:53:14,742 DEBUG CV Batch 4/600 loss 46.342560 loss_att 38.402050 loss_ctc 64.870415 history loss 88.531659 rank 2
2023-01-30 18:53:14,749 DEBUG CV Batch 4/600 loss 47.122055 loss_att 38.425243 loss_ctc 67.414604 history loss 88.823756 rank 0
2023-01-30 18:53:14,764 DEBUG CV Batch 4/600 loss 46.351166 loss_att 38.398983 loss_ctc 64.906250 history loss 88.422146 rank 1
2023-01-30 18:53:54,965 DEBUG CV Batch 4/700 loss 44.441872 loss_att 36.677708 loss_ctc 62.558262 history loss 92.181358 rank 3
2023-01-30 18:53:55,065 DEBUG CV Batch 4/700 loss 44.877800 loss_att 36.743958 loss_ctc 63.856770 history loss 92.239779 rank 0
2023-01-30 18:53:55,341 DEBUG CV Batch 4/700 loss 44.877766 loss_att 36.743904 loss_ctc 63.856777 history loss 92.046780 rank 1
2023-01-30 18:53:55,344 DEBUG CV Batch 4/700 loss 44.877800 loss_att 36.743954 loss_ctc 63.856770 history loss 92.113521 rank 2
2023-01-30 18:54:33,266 DEBUG CV Batch 4/800 loss 30.649742 loss_att 23.923923 loss_ctc 46.343323 history loss 92.472580 rank 0
2023-01-30 18:54:33,383 DEBUG CV Batch 4/800 loss 30.626713 loss_att 23.903809 loss_ctc 46.313492 history loss 92.436917 rank 3
2023-01-30 18:54:34,417 DEBUG CV Batch 4/800 loss 31.421110 loss_att 23.947063 loss_ctc 48.860554 history loss 92.407726 rank 2
2023-01-30 18:54:34,842 DEBUG CV Batch 4/800 loss 30.599287 loss_att 23.910414 loss_ctc 46.206657 history loss 92.288636 rank 1
2023-01-30 18:55:21,506 DEBUG CV Batch 4/900 loss 33.508255 loss_att 27.452938 loss_ctc 47.637329 history loss 98.493538 rank 3
2023-01-30 18:55:21,754 DEBUG CV Batch 4/900 loss 33.477356 loss_att 27.457464 loss_ctc 47.523773 history loss 98.568809 rank 0
2023-01-30 18:55:22,038 DEBUG CV Batch 4/900 loss 33.460239 loss_att 27.503674 loss_ctc 47.358894 history loss 98.430041 rank 1
2023-01-30 18:55:23,380 DEBUG CV Batch 4/900 loss 33.436291 loss_att 27.461983 loss_ctc 47.376347 history loss 98.415872 rank 2
2023-01-30 18:55:39,040 INFO Epoch 4 CV info cv_loss 98.36615292046831
2023-01-30 18:55:39,041 INFO Epoch 4 TRAIN info final lr 0.0003261806929873899
2023-01-30 18:55:39,041 INFO Checkpoint: save to checkpoint exp/conformer_wavaug1/4.pt
2023-01-30 18:55:39,092 INFO Epoch 4 CV info cv_loss 98.3194536657683
2023-01-30 18:55:39,093 INFO Epoch 4 TRAIN info final lr 0.0003262001287504334
2023-01-30 18:55:39,093 INFO Epoch 5 TRAIN info init lr 0.0003262001287504334
2023-01-30 18:55:39,097 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:39,544 INFO Epoch 4 CV info cv_loss 98.33389081785192
2023-01-30 18:55:39,545 INFO Epoch 4 TRAIN info final lr 0.0003261862457080438
2023-01-30 18:55:39,545 INFO Epoch 5 TRAIN info init lr 0.0003261862457080438
2023-01-30 18:55:39,549 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:39,867 INFO Epoch 4 CV info cv_loss 98.20686797192958
2023-01-30 18:55:39,867 INFO Epoch 4 TRAIN info final lr 0.0003261917987122867
2023-01-30 18:55:39,867 INFO Epoch 5 TRAIN info init lr 0.0003261917987122867
2023-01-30 18:55:39,869 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:55:41,034 INFO Epoch 5 TRAIN info init lr 0.0003261806929873899
2023-01-30 18:55:41,038 INFO using accumulate grad, new batch size is 16 times larger than before
2023-01-30 18:56:46,874 DEBUG TRAIN Batch 5/0 loss 17.460732 loss_att 14.531824 loss_ctc 24.294846 lr 0.00032619 rank 2
2023-01-30 18:56:46,879 DEBUG TRAIN Batch 5/0 loss 14.466322 loss_att 11.996610 loss_ctc 20.228985 lr 0.00032620 rank 3
2023-01-30 18:56:46,882 DEBUG TRAIN Batch 5/0 loss 18.419432 loss_att 15.551132 loss_ctc 25.112129 lr 0.00032618 rank 1
2023-01-30 18:56:46,882 DEBUG TRAIN Batch 5/0 loss 38.082077 loss_att 26.783554 loss_ctc 64.445297 lr 0.00032618 rank 0

Then I change the model config, with:

    output_size: 512    # dimension of attention
    attention_heads: 8
    linear_units: 2048  # the number of units of position-wise feed forward

and lr: 0.001 Then the model can be trained successfully.

I've tried to debug, but i still can't find the reason.

btw, The dict I used is 6300 Chinese chars and 5700 English BPE.

Apr 06 '23 06:04 duj12

I found out the reason of my case, I changed the default SpecaugConf to:

    spec_aug_conf:
        num_t_mask: 2
        num_f_mask: 2
        max_t: 30
        max_f: 20

Because I use this config for my previous experiments. This is OK with bigger Model. But smaller model may not learn the features well because of too many masks in y-axis.

Apr 06 '23 08:04 duj12

Hi, here is some additional information. When I use the smaller model config:

output_size: 256    # dimension of attention
attention_heads: 4
linear_units: 2048  # the number of units of position-wise feed forward

with cmvn=false in run.sh It seems that the u2++ model can't converge. And then I set cmvn=true, the smaller model can converge normally. But the bigger model config:

output_size: 512    # dimension of attention
attention_heads: 8
linear_units: 2048  # the number of units of position-wise feed forward

can converge even without cmvn(cmvn=false).

Apr 12 '23 06:04 duj12

wenet wenet copied to clipboard

CV loss oscillates near 140 from start

wenet
wenet copied to clipboard