snowfall
snowfall copied to clipboard
LSTM model
Right now decode.py is giving very terrible WER, over 90%, even after tuning bias_penalty. Would be nice if someone can try using an LSTM model (needed to give spiky outputs so CTC can work.)
Dunno if this is the issue or it's some other bug.
.. right now I'm busy on a decoder (intersect_pruned) rewrite to use less memory.
Community help welcomed.
I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.
Great!
On Thu, Dec 17, 2020 at 10:42 AM Piotr Żelasko [email protected] wrote:
I'll look into it. I'm also looking at other aspects of the recipe - e.g. we're currently using position dependent phones, so we're getting 4x the number of output symbols and I think it could be hurtful.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-747167286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6NZKQYT7FMOQ3VYXDSVFVXXANCNFSM4U5LJIUA .
I changed the phones to position independent and that's how an example of posteriors looks like in an unmodified model (the first is the "as-is" output, the other one is exp(posteriors)
to check if there are spikes or not):


To me the acoustic model looks okay-ish (although I should probably further shrink the output layers by removing the disambiguation symbols, which are the last few rows in the posterior matrix).
(this is in the middle of training, i.e. checkpoint from epoch 5)
Update: never mind, it was not the latest k2, and the WER for that model is still 99%. Almost all the hypotheses are empty texts. Will keep looking.
It seems there is a small bug in train.py and decode.py. In train.py:
supervision_segments = supervision_segments[indices]
texts = supervisions['text']
assert feature.ndim == 3
# print(supervision_segments[:, 1] + supervision_segments[:, 2])
The "texts" is not reorder by indices, while supervision_segments have been reorder. And in decode.py there is same bug. I reorder "texts" by indices and retrain the model, the WER reduce to 87.01% after 5 epoches.
FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:
2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]
I think the difference is really small between this and the train-clean-100
result (12~13%), could be due to a small (or "weak") model architecture.
Thanks! That difference is what I would expect. We'd need a larger model to get more benefit, I think.
On Tue, Jan 5, 2021 at 2:50 AM Piotr Żelasko [email protected] wrote:
FYI I ran the full 960h librispeech training (with speed perturbation) with the CTC graph, the WER is:
2021-01-03 08:53:22,337 INFO [decode.py:217] %WER 10.05% [5285 / 52576, 725 ins, 436 del, 4124 sub ]
I think the difference is really small between this and the train-clean-100 result (12~13%), could be due to a small (or "weak") model architecture.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/43#issuecomment-754149678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3BRRKHGIVJNCAXWULSYIEWTANCNFSM4U5LJIUA .