pocketsphinx
pocketsphinx copied to clipboard
New force-alignment API and two-pass alignment to get phone/state durations
Now you can (relatively) easily do a second pass of alignment to get phone durations after decoding or word alignment.
Note that this ignores the previously existing word boundaries for the moment, which probably isn't ideal. We should be able to constrain the state alignment to respect them without much trouble. In theory this should mostly just speed up alignment (it's a bit slow) and reduce memory consumption (it's really big).
Also, yeah, word alignment now uses FSG search, like SoundSwallower, so it's really fast and also handles silence and alternate pronunciations for you.
Excited to check this out! I'm at Interspeech and out of phase by half day and all, but I'll get a look shortly
No problem! The CLI for state alignment isn't quite there yet, but coming soon (tonight, I hope).
Fantastic! I also hope to try this out ASAP. I wonder whether constraining to the first pass's word boundaries will help. It seems like it can't hurt, but it would be interesting to measure how much.
On Wed, Sep 21, 2022 at 3:42 PM David Huggins-Daines < @.***> wrote:
No problem! The CLI for state alignment isn't quite there yet, but coming soon (tonight, I hope).
— Reply to this email directly, view it on GitHub https://github.com/cmusphinx/pocketsphinx/pull/300#issuecomment-1254308132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4RVFMZXPP37UTRA5BSBTV7OFOXANCNFSM6AAAAAAQSKE6YM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Fantastic! I also hope to try this out ASAP. I wonder whether constraining to the first pass's word boundaries will help. It seems like it can't hurt, but it would be interesting to measure how much.
It will definitely make the alignment faster. It may make it more accurate though I am not certain of this - I have to look at how I implemented this back in 2006: https://www.cs.cmu.edu/~dhuggins/Publications/phlab.pdf
EDIT: that paper was about forward-backward and not alignment, so not the same thing at all - in that case I implemented something like semi-Viterbi training, setting "impossible" phone sequences to zero probability, which resulted in models that were better for alignment (but somewhat worse for recognition)
Hoping for state level alignments, and frame level scores also, but LGTM and WFM
State level alignments are already there in the Python API, look at cython/test/alignment_test.py for an example, but it is now easy to add them to the command-line front-end as well, so I'll do that (not on by default though)