snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

LSTM model

Open danpovey opened this issue 4 years ago • 10 comments

Right now decode.py is giving very terrible WER, over 90%, even after tuning bias_penalty. Would be nice if someone can try using an LSTM model (needed to give spiky outputs so CTC can work.)

Dunno if this is the issue or it's some other bug.

danpovey avatar Dec 16 '20 03:12 danpovey

.. right now I'm busy on a decoder (intersect_pruned) rewrite to use less memory.

danpovey avatar Dec 16 '20 03:12 danpovey

Community help welcomed.

danpovey avatar Dec 16 '20 03:12 danpovey

I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.

pzelasko avatar Dec 17 '20 02:12 pzelasko

Great!

On Thu, Dec 17, 2020 at 10:42 AM Piotr Żelasko [email protected] wrote:

I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-747167286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6NZKQYT7FMOQ3VYXDSVFVXXANCNFSM4U5LJIUA .

danpovey avatar Dec 17 '20 02:12 danpovey

I changed the phones to position independent and that's how an example of posteriors looks like in an unmodified model (the first is the "as-is" output, the other one is exp(posteriors) to check if there are spikes or not):

image image

To me the acoustic model looks okay-ish (although I should probably further shrink the output layers by removing the disambiguation symbols, which are the last few rows in the posterior matrix).

pzelasko avatar Dec 17 '20 04:12 pzelasko

(this is in the middle of training, i.e. checkpoint from epoch 5)

pzelasko avatar Dec 17 '20 04:12 pzelasko

Update: never mind, it was not the latest k2, and the WER for that model is still 99%. Almost all the hypotheses are empty texts. Will keep looking.

pzelasko avatar Dec 17 '20 04:12 pzelasko

It seems there is a small bug in train.py and decode.py. In train.py:

 supervision_segments = supervision_segments[indices]

    texts = supervisions['text']
    assert feature.ndim == 3
    # print(supervision_segments[:, 1] + supervision_segments[:, 2])

The "texts" is not reorder by indices, while supervision_segments have been reorder. And in decode.py there is same bug. I reorder "texts" by indices and retrain the model, the WER reduce to 87.01% after 5 epoches.

Curisan avatar Dec 18 '20 06:12 Curisan

FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:

2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]

I think the difference is really small between this and the train-clean-100 result (12~13%), could be due to a small (or "weak") model architecture.

pzelasko avatar Jan 04 '21 18:01 pzelasko

Thanks! That difference is what I would expect. We'd need a larger model to get more benefit, I think.

On Tue, Jan 5, 2021 at 2:50 AM Piotr Żelasko [email protected] wrote:

FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:

2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]

I think the difference is really small between this and the train-clean-100 result (12~13%), could be due to a small (or "weak") model architecture.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-754149678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3BRRKHGIVJNCAXWULSYIEWTANCNFSM4U5LJIUA .

danpovey avatar Jan 05 '21 05:01 danpovey