vosk-api
vosk-api copied to clipboard
Added script to compute phoneme labels and timestamps
This is to add an ability to generate phone labels and timestamps in the Vosk recognizer output
- Updated model.cc to read phone symbol table (i.e. phones.txt)
- Phone table should be added under ("/graph/phones.txt") in your model directory following standard Kaldi convention
- Updated Kaldi recognizer script to compute phoneme labels and timestamps and add them to the json output
- Adds phone label, start and end timestamps in the word-level results only if you provide the phone symbol table. If you do not provide the phone symbol table then the recognizer will only generate the existing word-level features.
- Prints silence words along with the corresponding phone information. "Gaps" or silences with duration of 0 seconds duration that don't have corresponding phone information are filtered out.
- MBR decoding is disabled only for phone information extraction so that the outputs align but if you don't need phone output then you will be able to get word level result from MBR
Output looks like
"result" : [{
"conf" : 0.997802,
"end" : 0.450000,
"phone_end" : [0.450000],
"phone_label" : ["SIL"],
"phone_start" : [0.000000],
"start" : 0.000000,
"word" : "<eps>"
}, {
"conf" : 0.997153,
"end" : 0.600000,
"phone_end" : [0.540000, 0.600000],
"phone_label" : ["DH_B", "AH1_E"],
"phone_start" : [0.450000, 0.540000],
"start" : 0.450000,
"word" : "THE"
}, {
"conf" : 0.553237,
"end" : 1.200000,
"phone_end" : [0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000, 1.200000],
"phone_label" : ["S_B", "T_I", "UW1_I", "D_I", "AH0_I", "N_I", "T_I", "S_E"],
"phone_start" : [0.600000, 0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000],
"start" : 0.600000,
"word" : "STUDENT'S"
}, {
"conf" : 0.922575,
"end" : 1.260000,
"phone_end" : [1.260130],
"phone_label" : ["SIL"],
"phone_start" : [1.200130],
"start" : 1.200130,
"word" : "<eps>"
}, {
"conf" : 1.000000,
"end" : 1.800000,
"phone_end" : [1.440000, 1.500000, 1.590000, 1.680000, 1.800000],
"phone_label" : ["S_B", "T_I", "AH1_I", "D_I", "IY0_E"],
"phone_start" : [1.260000, 1.440000, 1.500000, 1.590000, 1.680000],
"start" : 1.260000,
"word" : "STUDY"
}, {
"conf" : 1.000000,
"end" : 1.860000,
"phone_end" : [1.860000],
"phone_label" : ["AH0_S"],
"phone_start" : [1.800000],
"start" : 1.800000,
"word" : "A"
}, {
"conf" : 1.000000,
"end" : 2.190000,
"phone_end" : [1.980000, 2.100000, 2.190000],
"phone_label" : ["L_B", "AA1_I", "T_E"],
"phone_start" : [1.860000, 1.980000, 2.100000],
"start" : 1.860000,
"word" : "LOT"
}, {
"conf" : 1.000000,
"end" : 2.880000,
"phone_end" : [2.880000],
"phone_label" : ["SIL"],
"phone_start" : [2.190000],
"start" : 2.190000,
"word" : "<eps>"
}],
"text" : " THE STUDENT'S STUDY A LOT"
}
Thank you, I'll try to merge coming week.
Thanks @nshmyrev! As an update, in our latest commit we have implemented a separate function to compute word and phone results if you want to generate phone level results aligned with word results (with confidences). The result option can now be configured using SetResultOptions()
with words
or phones
as input. The C script under c/test_phone_results.c
provides an example of how to set this option. Hope this looks much better!
Dear @nshmyrev , this functionality by @rutujaubale is fantastic!
I believe that many people would hugely appreciate it if you merged it any time soon just as you have previously planned. Thank you very much for your efforts.
I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.
I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.
That would be greatly appreciated!
Hi @rutujaubale , I'm trying to rebuild vosk with your modification. But I got an error to rebuild it.
It seems that the function you are using, CompactLatticeToWordProns
, is neither defined by the kaldi vosk used (https://github.com/alphacep/kaldi/blob/master/src/lat/lattice-functions.cc) nor defined by the official kaldi (https://github.com/kaldi-asr/kaldi/blob/master/src/lat/lattice-functions.cc) now. But the kaldi doc still has info about this function (https://kaldi-asr.org/doc/namespacekaldi.html#a8a2110207264ab1d31c2b04150541834).
Could you please let me know which version of kaldi are you using? Or how to build vosk with your modification?
sh-4.1# KALDI_ROOT=/opt/kaldi make
g++ -g -O3 -std=c++17 -Wno-deprecated-declarations -fPIC -DFST_NO_DYNAMIC_LINKING -I. -I/opt/kaldi/src -I/opt/kaldi/tools/openfst/include -I/opt/kaldi/tools/OpenBLAS/install/include -c -o kaldi_recognizer.o kaldi_recognizer.cc
kaldi_recognizer.cc: In function ‘void ComputePhoneInfo(const kaldi::TransitionModel&, const CompactLattice&, const fst::SymbolTable&, const fst::SymbolTable&, std::vector<std::vector<std::basic_string<char> > >*, std::vector<std::vector<int> >*)’:
kaldi_recognizer.cc:425:12: error: ‘CompactLatticeToWordProns’ is not a member of ‘kaldi’
kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, ×_lat, &lengths,&prons, phone_lengths);
^~~~~~~~~~~~~~~~~~~~~~~~~
kaldi_recognizer.cc:425:12: note: suggested alternative: ‘CompactLatticeToWordAlignment’
kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, ×_lat, &lengths,&prons, phone_lengths);
^~~~~~~~~~~~~~~~~~~~~~~~~
CompactLatticeToWordAlignment
make: *** [kaldi_recognizer.o] Error 1
@zhenxili96 I had to include lat/lattice-functions-transition-model.h
in kaldi_recognizer.h
to get it working like so:
#include "lat/lattice-functions-transition-model.h"
Thanks @mmende, it really helps.
Hello, I also found this pull request while looking for the same feature. I figured I would leave a comment (given it has been a while, and no official comment appears to have been made) since it would be extremely useful to have this sort of functionality!
Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api
I also fixed a few issues with this PR's code (most notably, the result_opts_
setting was not being taken into account).
Is there anyway to get both phonemes and words at the same time for Spanish? I checked the two available Spanish models, neither of them have a phones.txt
Thanks and appreciate your help!
I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.
I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.
Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.
Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api
I also fixed a few issues with this PR's code (most notably, the
result_opts_
setting was not being taken into account).
Maybe just go ahead and ope your own PR if you have a merge-able version of this code? I would love to see this feature released and to use it!
@kevin, Rhubarb looks cool. I will check it out more later.
On Wed, May 31, 2023, 5:38 PM Kevin Harrington @.***> wrote:
I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.
Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.
— Reply to this email directly, view it on GitHub https://github.com/alphacep/vosk-api/pull/528#issuecomment-1571144220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3GHRV7TYX7KYLCIDXEATXI7QBPANCNFSM44DRJWNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>