icefall
icefall copied to clipboard
MMI Training with Librispeech recipe doesn't converge with or without alignment
Hi, Thanks always for your research.
I've been trying to train conformer_mmi with librispeech recipe at the current commit. (e334e570d838cbe15188201f6bd47c009b9292be)
The only thing I changed in conformer_mmi/train.py is codes for preparing dataset corresponding to current lhotse package. I used same codes for loading dataset in conformer_ctc/train.py
It seems that training just worked well but training loss started vibrating up and down and didnt converge at the certain point, so I tried to use alignment with conformer_ctc librispeech model. But it didn't work well, either. I wondered default lr_factor(5.0) would be too high so I decreased lr_factor to 1.0, but the loss didn't converge either.
(Note that conformer_ctc training was successful at the current commit, so I used conformer_ctc/ali.py to create ali_dir. Actually, output format of alignment from confomer_ctc at the current commit is not compatible with mmi training, so I used previous commit(4890e27b4547ba365ac9b18af1dc1fdd87d3a1b9) to build .pt
format alignment.)
Below is the command I used and tensorboard log. 4 nvidia A100 GPUs, CUDA Version: 11.4, Driver Version: 470.103.01, OS: Ubuntu 20.04
CUDA_VISIBLE_DEVICES=0,1,2,3 ./conformer_mmi/train.py \
--world-size 4 \
--full-libri 1 \
--max-duration 200 \
--num-epochs 30
without alignment
with alignment
We should have someone run this recipe locally to see if we can reproduce the issue.
On Tue, Nov 15, 2022, 9:14 PM Joonwon Lee (dan) @.***> wrote:
Hi, Thanks always for your research.
I've been trying to train conformer_mmi with librispeech recipe at current commit. (e334e57 https://github.com/k2-fsa/icefall/commit/e334e570d838cbe15188201f6bd47c009b9292be )
The only thing I changed in conformer_mmi/train.py is codes for preparing dataset corresponding to current lhotse library. I used same codes for loading dataset in conformer_ctc/train.py
It seems that training just worked well but training loss started vibrating up and down and didnt converge at the certain point, so I tried to use alignment with conformer_ctc librispeech model. But it didn't work well, either. I wondered default lr_factor(5.0) would be too high so I decreased lr_factor to 1.0, but the loss didn't converge either.
(Note that conformer_ctc training was successful at current commit, so I used conformer_ctc/ali.py to create ali_dir. Actually, output format of alignment from confomer_ctc at current commit is not compatible with mmi training, so I used previous commit(4890e27 https://github.com/k2-fsa/icefall/commit/4890e27b4547ba365ac9b18af1dc1fdd87d3a1b9) to build .pt format alignment.)
Below is the command I used and tensorboard log. 4 nvidia A100 GPUs, CUDA Version: 11.4, Driver Version: 470.103.01, OS: Ubuntu 20.04
CUDA_VISIBLE_DEVICES=0,1,2,3 ./conformer_mmi/train.py
--world-size 4
--full-libri 1
--max-duration 200
--num-epochs 30without alignment [image: image] https://user-images.githubusercontent.com/10019529/201927489-2b6f8e32-1db1-4046-b662-bf7d36d9659d.png
with alignment [image: image] https://user-images.githubusercontent.com/10019529/201928365-2a8b0f99-f7f5-4282-8fcf-05ca89104847.png
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4C4S36EQFWNIBOJA3WIOEC5ANCNFSM6AAAAAASA5VHBU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
We should have someone run this recipe locally to see if we can reproduce the issue.
I will try.
@dan-legit Just FYI, I can't get it converged either, I am debugging the model.
@dan-legit Can you try this fixes https://github.com/k2-fsa/icefall/pull/700, actually, I only add max_arcs=2147483600
to intersect_dense
in icefall/mmi.py
. See if it works for you. We will continue tuning this model.
@pkufool Okay, I'll try MMI training again with your fix max_arcs=2147483600
@dan-legit Can you try this fixes #700, actually, I only add
max_arcs=2147483600
tointersect_dense
inicefall/mmi.py
. See if it works for you. We will continue tuning this model.
Training works stably for both 100 hour and 960 hour. The decoding result below is from 100 hour trained model. (I did not train conformer-mmi with attention decoder.)
It seems that max_arcs=2147483600
helped searching better lattice to calcualte MMI Loss so model could converge.
data | ctc-decoding | 1best | nbest-rescoring |
---|---|---|---|
eval-clean | 116.73% | 6.66% | 6.22% |
eval-other | 149.45% | 18.05% | 17.93% |
