snowfall LSTM model

Right now decode.py is giving very terrible WER, over 90%, even after tuning bias_penalty. Would be nice if someone can try using an LSTM model (needed to give spiky outputs so CTC can work.)

Dunno if this is the issue or it's some other bug.

Dec 16 '20 03:12 danpovey

.. right now I'm busy on a decoder (intersect_pruned) rewrite to use less memory.

Dec 16 '20 03:12 danpovey

Community help welcomed.

Dec 16 '20 03:12 danpovey

I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.

Dec 17 '20 02:12 pzelasko

Great!

On Thu, Dec 17, 2020 at 10:42 AM Piotr Żelasko [email protected] wrote:

I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-747167286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6NZKQYT7FMOQ3VYXDSVFVXXANCNFSM4U5LJIUA .

Dec 17 '20 02:12 danpovey

I changed the phones to position independent and that's how an example of posteriors looks like in an unmodified model (the first is the "as-is" output, the other one is exp(posteriors) to check if there are spikes or not):

To me the acoustic model looks okay-ish (although I should probably further shrink the output layers by removing the disambiguation symbols, which are the last few rows in the posterior matrix).

Dec 17 '20 04:12 pzelasko

(this is in the middle of training, i.e. checkpoint from epoch 5)

Dec 17 '20 04:12 pzelasko

Update: never mind, it was not the latest k2, and the WER for that model is still 99%. Almost all the hypotheses are empty texts. Will keep looking.

Dec 17 '20 04:12 pzelasko

It seems there is a small bug in train.py and decode.py. In train.py：

 supervision_segments = supervision_segments[indices]

    texts = supervisions['text']
    assert feature.ndim == 3
    # print(supervision_segments[:, 1] + supervision_segments[:, 2])

The "texts" is not reorder by indices, while supervision_segments have been reorder. And in decode.py there is same bug. I reorder "texts" by indices and retrain the model, the WER reduce to 87.01% after 5 epoches.

Dec 18 '20 06:12 Curisan

FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:

2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]

I think the difference is really small between this and the train-clean-100 result (12~13%), could be due to a small (or "weak") model architecture.

Jan 04 '21 18:01 pzelasko

Thanks! That difference is what I would expect. We'd need a larger model to get more benefit, I think.

On Tue, Jan 5, 2021 at 2:50 AM Piotr Żelasko [email protected] wrote:

FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:

2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]

I think the difference is really small between this and the train-clean-100 result (12~13%), could be due to a small (or "weak") model architecture.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-754149678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3BRRKHGIVJNCAXWULSYIEWTANCNFSM4U5LJIUA .

Jan 05 '21 05:01 danpovey

snowfall snowfall copied to clipboard

LSTM model

snowfall
snowfall copied to clipboard