snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Full-librispeech training

Open danpovey opened this issue 4 years ago • 27 comments

Guys, I ran the current setup on the full librispeech data for 3 epochs and this issue is mostly just an FYI so you can see what I got. I am thinking perhaps we can start doing experiments on full-libri for just 2 or 3 epochs, since it won't take much longer than 10 epochs on the smaller data and the results are (a) quite a bit better, especially on test-other and (b) perhaps more indicative of what we'd get in larger setups.

I think our test-clean errors are probably dominated by language modeling issues, which may explain why the improvement is only 1.5% absolute, vs. 6% absolute on test-other.

Depends what you guys think... we should probably agree on one setup that we can mostly use, for consistency.

2021-04-02 09:06:11,165 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-0.pt: epoch=0, learning_rate=0, objf=0.35558393759552065, valid_objf=0.15738312575246463
2021-04-02 19:14:51,307 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-1.pt: epoch=1, learning_rate=0.00042936316818072904, objf=0.18261609444931104, valid_objf=0.14119687143085302
2021-04-03 05:22:19,287 INFO [common.py:158] Save checkpoint to exp-conformer-noam-mmi-att-musan-sa-full/epoch-2.pt: epoch=2, learning_rate=0.0003036056078123336, objf=0.16412144019593852, valid_objf=0.13489163773498417

Decoding with python3 mmi_att_transformer_decode.py --epoch=3 --avg=1

2021-04-03 14:07:21,184 INFO [common.py:356] [test-clean] %WER 5.53% [2910 / 52576, 301 ins, 287 del, 2322 sub ]
2021-04-03 14:09:29,506 INFO [common.py:356] [test-other] %WER 11.79% [6170 / 52343, 636 ins, 600 del, 4934 sub ]

danpovey avatar Apr 03 '21 07:04 danpovey

Nice! I think there are two more easy wins: using tglarge for decoding (I think we’re using tgmed currently) and saving checkpoints more frequently than per epoch so we can also benefit from averaging here.

pzelasko avatar Apr 03 '21 12:04 pzelasko

MM yes. Fangjun is already working on LM rescoring with the 4-gram model, for which I am currently working on GPU intersection code that works for that case. Perhaps there could be a checkpoints-per-epoch parameter?

danpovey avatar Apr 03 '21 13:04 danpovey

The data augmentation setup probably needs some tuning. I ran the full libri recipe as-is, and got:

Epoch 3:
2021-04-07 11:42:46,500 INFO [common.py:357] [test-clean] %WER 5.53% [2909 / 52576, 301 ins, 262 del, 2346 sub ]
2021-04-07 11:44:30,129 INFO [common.py:357] [test-other] %WER 11.60% [6074 / 52343, 640 ins, 615 del, 4819 sub ]

(also the average of epochs 2 and 3 yields improvement: 5.37% and 11.13%)

Then, I also ran it without any data augmentation (I dropped speed perturbed cuts, removed MUSAN and SpecAug, and increased training time to 9 epochs, as each epoch is now 3x smaller, so the network sees the same number of hours)

Libri 960h no aug

Epoch 1:
2021-04-07 18:11:28,655 INFO [common.py:357] [test-clean] %WER 8.68% [4564 / 52576, 493 ins, 514 del, 3557 sub ]
2021-04-07 18:13:38,786 INFO [common.py:357] [test-other] %WER 18.84% [9863 / 52343, 852 ins, 1208 del, 7803 sub ]
Epoch 2:
2021-04-07 20:49:00,742 INFO [common.py:356] [test-clean] %WER 7.02% [3693 / 52576, 443 ins, 371 del, 2879 sub ]
2021-04-07 20:51:00,451 INFO [common.py:356] [test-other] %WER 15.77% [8253 / 52343, 905 ins, 747 del, 6601 sub ]
Epoch 3:
2021-04-07 22:07:01,294 INFO [common.py:356] [test-clean] %WER 6.33% [3326 / 52576, 357 ins, 325 del, 2644 sub ]
2021-04-07 22:08:15,296 INFO [common.py:356] [test-other] %WER 14.49% [7582 / 52343, 758 ins, 796 del, 6028 sub ]
Epoch 4:
2021-04-08 08:01:33,173 INFO [common.py:356] [test-clean] %WER 6.10% [3208 / 52576, 376 ins, 297 del, 2535 sub ]
2021-04-08 08:03:48,852 INFO [common.py:356] [test-other] %WER 13.78% [7213 / 52343, 886 ins, 618 del, 5709 sub ]
Epoch 5:
2021-04-08 08:05:21,075 INFO [common.py:356] [test-clean] %WER 5.92% [3114 / 52576, 356 ins, 265 del, 2493 sub ]
2021-04-08 08:06:29,545 INFO [common.py:356] [test-other] %WER 13.54% [7086 / 52343, 783 ins, 644 del, 5659 sub ]
Epoch 6:
2021-04-08 08:14:20,788 INFO [common.py:356] [test-clean] %WER 5.50% [2891 / 52576, 328 ins, 254 del, 2309 sub ]
2021-04-08 08:15:28,709 INFO [common.py:356] [test-other] %WER 12.85% [6726 / 52343, 784 ins, 600 del, 5342 sub ]
Epoch 7:
2021-04-08 11:37:02,413 INFO [common.py:356] [test-clean] %WER 5.62% [2956 / 52576, 312 ins, 270 del, 2374 sub ]
2021-04-08 11:38:46,792 INFO [common.py:356] [test-other] %WER 12.93% [6770 / 52343, 710 ins, 632 del, 5428 sub ]
Epoch 8:
2021-04-08 15:30:49,968 INFO [common.py:356] [test-clean] %WER 5.61% [2948 / 52576, 330 ins, 283 del, 2335 sub ]
2021-04-08 15:31:59,328 INFO [common.py:356] [test-other] %WER 12.78% [6692 / 52343, 766 ins, 584 del, 5342 sub ]
Epoch 9:
2021-04-09 08:50:07,223 INFO [common.py:356] [test-clean] %WER 5.60% [2946 / 52576, 321 ins, 309 del, 2316 sub ]
2021-04-09 08:51:54,651 INFO [common.py:356] [test-other] %WER 12.59% [6592 / 52343, 705 ins, 616 del, 5271 sub ]


Average (last 4 epochs):
2021-04-09 08:49:39,470 INFO [common.py:356] [test-clean] %WER 5.25% [2762 / 52576, 318 ins, 260 del, 2184 sub ]
2021-04-09 08:50:27,926 INFO [common.py:356] [test-other] %WER 11.72% [6136 / 52343, 708 ins, 535 del, 4893 sub ]

Such a small difference doesn't seem right, does it?

pzelasko avatar Apr 09 '21 12:04 pzelasko

It seems plausible to me. It could be that we'd only see improvements after more epochs of training. BTW, if it isn't already supported, can you add an option to randomize the position of cuts in minibatches? I mean, so the silence isn't justified to the right but is allocated randomly? The motivation is that when we use subsampling on the output of the model, it isn't invariant to shifts modulo the subsampling factor (e.g. modulo 4), so the random shift acts a bit like data augmentation.

danpovey avatar Apr 10 '21 03:04 danpovey

I think we already have implemented "both" sides padding; which would center the cuts. It'd look sth like this (except the cuts would be concatenated first, with data augmentation applied):

image

Does it make sense? I can make it the default behaviour.

pzelasko avatar Apr 12 '21 16:04 pzelasko

OK... as long as the way the cuts are grouped into minibatches is random so the total length of the sequence is different (for a given cut) from epoch to epoch, that should have the same effect as randomization.

On Tue, Apr 13, 2021 at 12:14 AM Piotr Żelasko @.***> wrote:

I think we already have implemented "both" sides padding; which would center the cuts. It'd look sth like this (except the cuts would be concatenated first, with data augmentation applied):

[image: image] https://user-images.githubusercontent.com/15930688/114426826-73777c00-9b88-11eb-9fc1-046af443f817.png

Does it make sense?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/146#issuecomment-817940774, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2NLTRPBKXOIV2OV4TTIML5RANCNFSM42J6CVFA .

danpovey avatar Apr 12 '21 16:04 danpovey

There is also a transform called ExtraPadding that adds a fixed number N of padding frames to the cut (N/2 on each side); I can extend it so that it is randomized.

pzelasko avatar Apr 12 '21 16:04 pzelasko

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:

2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

pzelasko avatar Apr 20 '21 15:04 pzelasko

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:


2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]

2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

Wow!

csukuangfj avatar Apr 20 '21 15:04 csukuangfj

Actually I just realized it will be interesting for you to see more info; here's a WER breakdown by epoch and numbers for different average settings, and with/without rescoring. You can notice that the WER at epoch 3 is worse than when trained on 1 GPU, which is likely explained by 4x lesser number of optimizer steps due to 4 GPU (partially, but not fully, counter-acted by linearly increased LR)

Epoch 1:
2021-04-18 12:01:52,735 INFO [common.py:364] [test-clean] %WER 8.08% [4249 / 52576, 557 ins, 312 del, 3380 sub ]
2021-04-18 12:02:48,774 INFO [common.py:364] [test-other] %WER 17.64% [9231 / 52343, 1148 ins, 846 del, 7237 sub ]
Epoch 2:
2021-04-18 15:17:08,612 INFO [common.py:364] [test-clean] %WER 6.33% [3329 / 52576, 395 ins, 300 del, 2634 sub ]
2021-04-18 15:18:06,073 INFO [common.py:364] [test-other] %WER 13.48% [7056 / 52343, 734 ins, 736 del, 5586 sub ]
Epoch 3:
2021-04-18 15:19:37,956 INFO [common.py:364] [test-clean] %WER 5.82% [3060 / 52576, 357 ins, 289 del, 2414 sub ]
2021-04-18 15:20:31,606 INFO [common.py:364] [test-other] %WER 12.35% [6462 / 52343, 757 ins, 586 del, 5119 sub ]
Epoch 4:
2021-04-19 10:12:39,576 INFO [common.py:364] [test-clean] %WER 5.46% [2872 / 52576, 327 ins, 247 del, 2298 sub ]
2021-04-19 10:14:33,516 INFO [common.py:364] [test-other] %WER 11.81% [6181 / 52343, 712 ins, 588 del, 4881 sub ]
Epoch 5:
2021-04-19 10:16:39,978 INFO [common.py:364] [test-clean] %WER 5.48% [2879 / 52576, 347 ins, 243 del, 2289 sub ]
2021-04-19 10:17:48,955 INFO [common.py:364] [test-other] %WER 11.35% [5943 / 52343, 727 ins, 523 del, 4693 sub ]
Epoch 6:
2021-04-19 10:19:19,325 INFO [common.py:364] [test-clean] %WER 5.14% [2703 / 52576, 290 ins, 253 del, 2160 sub ]
2021-04-19 10:20:28,177 INFO [common.py:364] [test-other] %WER 10.82% [5661 / 52343, 622 ins, 549 del, 4490 sub ]
Epoch 7:
2021-04-19 14:58:01,037 INFO [common.py:364] [test-clean] %WER 5.15% [2706 / 52576, 300 ins, 244 del, 2162 sub ]
2021-04-19 14:59:10,205 INFO [common.py:364] [test-other] %WER 10.85% [5678 / 52343, 662 ins, 528 del, 4488 sub ]
Epoch 8:
2021-04-19 16:21:46,197 INFO [common.py:364] [test-clean] %WER 4.99% [2626 / 52576, 318 ins, 211 del, 2097 sub ]
2021-04-19 16:22:52,261 INFO [common.py:364] [test-other] %WER 10.50% [5497 / 52343, 614 ins, 481 del, 4402 sub ]
Epoch 9:
2021-04-20 09:15:25,490 INFO [common.py:364] [test-clean] %WER 4.96% [2606 / 52576, 289 ins, 213 del, 2104 sub ]
2021-04-20 09:16:20,003 INFO [common.py:364] [test-other] %WER 10.49% [5492 / 52343, 646 ins, 485 del, 4361 sub ]
Epoch 10:
2021-04-20 09:17:33,124 INFO [common.py:364] [test-clean] %WER 5.14% [2702 / 52576, 287 ins, 257 del, 2158 sub ]
2021-04-20 09:18:25,427 INFO [common.py:364] [test-other] %WER 10.60% [5548 / 52343, 652 ins, 453 del, 4443 sub ]


Average (epochs 4, 5, 6):
2021-04-19 10:12:16,678 INFO [common.py:364] [test-clean] %WER 5.02% [2641 / 52576, 308 ins, 235 del, 2098 sub ]
2021-04-19 10:13:28,620 INFO [common.py:364] [test-other] %WER 10.16% [5319 / 52343, 623 ins, 463 del, 4233 sub ]

Average (epochs 8, 9, 10):
2021-04-20 09:17:24,622 INFO [common.py:364] [test-clean] %WER 4.71% [2477 / 52576, 291 ins, 201 del, 1985 sub ]
2021-04-20 09:19:01,079 INFO [common.py:364] [test-other] %WER 9.65% [5052 / 52343, 604 ins, 422 del, 4026 sub ]

Average (epochs 8, 9, 10) with rescoring:
2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

pzelasko avatar Apr 20 '21 15:04 pzelasko

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

pzelasko avatar Apr 20 '21 15:04 pzelasko

Could you post the results with LM rescoring disabled. It's using the whole lattice for rescoring by default.

Pass --use-lm-rescoring=0 to the commandline can disable LM rescoring. Just want to know what's the role LM rescoring plays here.

csukuangfj avatar Apr 20 '21 15:04 csukuangfj

I think I did it concurrently to your question -- only the last result in my previous message has the rescoring turned on.

pzelasko avatar Apr 20 '21 15:04 pzelasko

I think I did it concurrently to your question -- only the last result in my previous message has the rescoring turned on.

Thanks! GitHub didn't show the results when I commented.

csukuangfj avatar Apr 20 '21 22:04 csukuangfj

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

Are there tensorboard logs in your case? That contains time stamps.

csukuangfj avatar Apr 20 '21 22:04 csukuangfj

Also, I think 1 epoch takes only 2-2.5 hours in this setting (I don't know exactly because I still haven't figured out how to fix the logs other than with print which doesn't show timestamps 😅)

Are there tensorboard logs in your case? That contains time stamps.

Good point. It's ~1h:45min per epoch. I think it can still get a bit better if we decay the LR faster, it was still quite high (~1.1e-3) at the end of training.

pzelasko avatar Apr 21 '21 00:04 pzelasko

FYI if we train full Libri with the current augmentation setup for 10 epochs on 4 x 32GB V100 with 4 x learning rate, average last 3 epochs and use default rescoring settings (4-gram LM with lattice beam 8) we get:

2021-04-20 10:56:22,431 INFO [common.py:373] [test-clean] %WER 4.18% [2200 / 52576, 378 ins, 113 del, 1709 sub ]
2021-04-20 11:02:39,051 INFO [common.py:373] [test-other] %WER 8.54% [4471 / 52343, 733 ins, 243 del, 3495 sub ]

@pzelasko Is it possible to share this trained model? I want to do n-best rescoring with transformer LM with it. Previous exps of n-best rescoring with transformer LM already got lower wer than 4-gram LM with AM model trained by fangjun. Results are in the first comment of this conversation.

glynpu avatar Apr 21 '21 02:04 glynpu

@glynpu sure! You should be able to download it here: https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x

LMK if that doesn't work.

pzelasko avatar Apr 21 '21 14:04 pzelasko

@pzelasko Thanks for sharing! Why is the best_model.pt(128MB) so much smaller than others(384MB)?

image

glynpu avatar Apr 22 '21 02:04 glynpu

I think “best_model” doesn’t store the optimizer, scheduler, etc. state dicts for resuming training. Also it is not necessarily best, since it’s picked based on dev loss and not on WER (and it is not averaged).

pzelasko avatar Apr 22 '21 03:04 pzelasko

Thanks. So I should average model of 8,9,10 epoch to reproduce your best result. But epoch-{9.pt 10.pt}.pt seems doesn't exist in this shared folder.

glynpu avatar Apr 22 '21 04:04 glynpu

@glynpu

@pzelasko counts from 1, not from 0. So you should use epoch-{7,8,9}.pt

csukuangfj avatar Apr 22 '21 04:04 csukuangfj

@pzelasko Could you please check the shared folder? Some models don't exist in shared folder. There are only epoch-[0,1,4,5,8}.pt. https://livejohnshopkins-my.sharepoint.com/:f:/g/personal/pzelask2_jh_edu/EjpFSUZ1WXlItIWlf-YemmIBTbNkbA3fovl_kZv0tQFupw?e=JZHh6x

glynpu avatar Apr 22 '21 04:04 glynpu

That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour.

pzelasko avatar Apr 22 '21 14:04 pzelasko

@glynpu

@pzelasko counts from 1, not from 0. So you should use epoch-{7,8,9}.pt

We'll probably need to make the indexing consistent, different parts of code base count from 0, others from 1...

pzelasko avatar Apr 22 '21 14:04 pzelasko

That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour.

Thanks! Got models and results of transformer LM n-best rescore are:

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore * * * * 4.71 9.66
4-gram LM n-best rescore * 100 * * 4.38 9.18
4-gram LM lattice rescore * * * * 4.18 8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24 3.61 8.29

glynpu avatar Apr 23 '21 07:04 glynpu

Fantastic-- thanks! Can you try much less weight decay in the transformer setup? I notice 0.001 which IMO is too high. And use the noam optimizer if you weren't already.

On Fri, Apr 23, 2021 at 3:53 PM LIyong.Guo @.***> wrote:

That's weird. Something went wrong when uploading. I'm pushing the missing files, you can expect them to be there in the next hour.

Thanks! Got models and results of transformer LM n-best rescore are: rescore LM epoch num_paths token ppl word ppl test-clean test-other baseline no rescore (Piotr's am with full librispeech) * * * * 4.71 9.66 4-gram LM n-best rescore(Piotr's am with full librispeech) * 100 * * 4.38 9.18 4-gram LM lattice rescore * * * * 4.18 8.54 transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24 3.61 8.29

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/146#issuecomment-825469776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3GJXJ7HBFX25TH6KLTKERP7ANCNFSM42J6CVFA .

danpovey avatar Apr 23 '21 15:04 danpovey