icefall icon indicating copy to clipboard operation
icefall copied to clipboard

MMI Training with Librispeech recipe doesn't converge with or without alignment

Open dan-legit opened this issue 2 years ago • 6 comments

Hi, Thanks always for your research.

I've been trying to train conformer_mmi with librispeech recipe at the current commit. (e334e570d838cbe15188201f6bd47c009b9292be)

The only thing I changed in conformer_mmi/train.py is codes for preparing dataset corresponding to current lhotse package. I used same codes for loading dataset in conformer_ctc/train.py

It seems that training just worked well but training loss started vibrating up and down and didnt converge at the certain point, so I tried to use alignment with conformer_ctc librispeech model. But it didn't work well, either. I wondered default lr_factor(5.0) would be too high so I decreased lr_factor to 1.0, but the loss didn't converge either.

(Note that conformer_ctc training was successful at the current commit, so I used conformer_ctc/ali.py to create ali_dir. Actually, output format of alignment from confomer_ctc at the current commit is not compatible with mmi training, so I used previous commit(4890e27b4547ba365ac9b18af1dc1fdd87d3a1b9) to build .pt format alignment.)

Below is the command I used and tensorboard log. 4 nvidia A100 GPUs, CUDA Version: 11.4, Driver Version: 470.103.01, OS: Ubuntu 20.04

CUDA_VISIBLE_DEVICES=0,1,2,3 ./conformer_mmi/train.py \
--world-size 4 \
--full-libri 1 \
--max-duration 200 \
--num-epochs 30

without alignment image

with alignment image

dan-legit avatar Nov 15 '22 13:11 dan-legit

We should have someone run this recipe locally to see if we can reproduce the issue.

On Tue, Nov 15, 2022, 9:14 PM Joonwon Lee (dan) @.***> wrote:

Hi, Thanks always for your research.

I've been trying to train conformer_mmi with librispeech recipe at current commit. (e334e57 https://github.com/k2-fsa/icefall/commit/e334e570d838cbe15188201f6bd47c009b9292be )

The only thing I changed in conformer_mmi/train.py is codes for preparing dataset corresponding to current lhotse library. I used same codes for loading dataset in conformer_ctc/train.py

It seems that training just worked well but training loss started vibrating up and down and didnt converge at the certain point, so I tried to use alignment with conformer_ctc librispeech model. But it didn't work well, either. I wondered default lr_factor(5.0) would be too high so I decreased lr_factor to 1.0, but the loss didn't converge either.

(Note that conformer_ctc training was successful at current commit, so I used conformer_ctc/ali.py to create ali_dir. Actually, output format of alignment from confomer_ctc at current commit is not compatible with mmi training, so I used previous commit(4890e27 https://github.com/k2-fsa/icefall/commit/4890e27b4547ba365ac9b18af1dc1fdd87d3a1b9) to build .pt format alignment.)

Below is the command I used and tensorboard log. 4 nvidia A100 GPUs, CUDA Version: 11.4, Driver Version: 470.103.01, OS: Ubuntu 20.04

CUDA_VISIBLE_DEVICES=0,1,2,3 ./conformer_mmi/train.py
--world-size 4
--full-libri 1
--max-duration 200
--num-epochs 30

without alignment [image: image] https://user-images.githubusercontent.com/10019529/201927489-2b6f8e32-1db1-4046-b662-bf7d36d9659d.png

with alignment [image: image] https://user-images.githubusercontent.com/10019529/201928365-2a8b0f99-f7f5-4282-8fcf-05ca89104847.png

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4C4S36EQFWNIBOJA3WIOEC5ANCNFSM6AAAAAASA5VHBU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

danpovey avatar Nov 15 '22 15:11 danpovey

We should have someone run this recipe locally to see if we can reproduce the issue.

I will try.

pkufool avatar Nov 16 '22 02:11 pkufool

@dan-legit Just FYI, I can't get it converged either, I am debugging the model.

pkufool avatar Nov 21 '22 02:11 pkufool

@dan-legit Can you try this fixes https://github.com/k2-fsa/icefall/pull/700, actually, I only add max_arcs=2147483600 to intersect_dense in icefall/mmi.py. See if it works for you. We will continue tuning this model.

pkufool avatar Nov 23 '22 07:11 pkufool

@pkufool Okay, I'll try MMI training again with your fix max_arcs=2147483600

dan-legit avatar Nov 24 '22 04:11 dan-legit

@dan-legit Can you try this fixes #700, actually, I only add max_arcs=2147483600 to intersect_dense in icefall/mmi.py. See if it works for you. We will continue tuning this model.

Training works stably for both 100 hour and 960 hour. The decoding result below is from 100 hour trained model. (I did not train conformer-mmi with attention decoder.)

It seems that max_arcs=2147483600 helped searching better lattice to calcualte MMI Loss so model could converge.

data ctc-decoding 1best nbest-rescoring
eval-clean 116.73% 6.66% 6.22%
eval-other 149.45% 18.05% 17.93%
image

dan-legit avatar Nov 26 '22 06:11 dan-legit