vosk-api Added script to compute phoneme labels and timestamps

Added script to compute phoneme labels and timestamps

Open rutujaubale opened this issue 3 years ago • 15 comments

This is to add an ability to generate phone labels and timestamps in the Vosk recognizer output

Updated model.cc to read phone symbol table (i.e. phones.txt)
Phone table should be added under ("/graph/phones.txt") in your model directory following standard Kaldi convention
Updated Kaldi recognizer script to compute phoneme labels and timestamps and add them to the json output
Adds phone label, start and end timestamps in the word-level results only if you provide the phone symbol table. If you do not provide the phone symbol table then the recognizer will only generate the existing word-level features.
Prints silence words along with the corresponding phone information. "Gaps" or silences with duration of 0 seconds duration that don't have corresponding phone information are filtered out.
MBR decoding is disabled only for phone information extraction so that the outputs align but if you don't need phone output then you will be able to get word level result from MBR

Output looks like

  "result" : [{
      "conf" : 0.997802,
      "end" : 0.450000,
      "phone_end" : [0.450000],
      "phone_label" : ["SIL"],
      "phone_start" : [0.000000],
      "start" : 0.000000,
      "word" : "<eps>"
    }, {
      "conf" : 0.997153,
      "end" : 0.600000,
      "phone_end" : [0.540000, 0.600000],
      "phone_label" : ["DH_B", "AH1_E"],
      "phone_start" : [0.450000, 0.540000],
      "start" : 0.450000,
      "word" : "THE"
    }, {
      "conf" : 0.553237,
      "end" : 1.200000,
      "phone_end" : [0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000, 1.200000],
      "phone_label" : ["S_B", "T_I", "UW1_I", "D_I", "AH0_I", "N_I", "T_I", "S_E"],
      "phone_start" : [0.600000, 0.720000, 0.810000, 0.870000, 0.930000, 0.990000, 1.080000, 1.110000],
      "start" : 0.600000,
      "word" : "STUDENT'S"
    }, {
      "conf" : 0.922575,
      "end" : 1.260000,
      "phone_end" : [1.260130],
      "phone_label" : ["SIL"],
      "phone_start" : [1.200130],
      "start" : 1.200130,
      "word" : "<eps>"
    }, {
      "conf" : 1.000000,
      "end" : 1.800000,
      "phone_end" : [1.440000, 1.500000, 1.590000, 1.680000, 1.800000],
      "phone_label" : ["S_B", "T_I", "AH1_I", "D_I", "IY0_E"],
      "phone_start" : [1.260000, 1.440000, 1.500000, 1.590000, 1.680000],
      "start" : 1.260000,
      "word" : "STUDY"
    }, {
      "conf" : 1.000000,
      "end" : 1.860000,
      "phone_end" : [1.860000],
      "phone_label" : ["AH0_S"],
      "phone_start" : [1.800000],
      "start" : 1.800000,
      "word" : "A"
    }, {
      "conf" : 1.000000,
      "end" : 2.190000,
      "phone_end" : [1.980000, 2.100000, 2.190000],
      "phone_label" : ["L_B", "AA1_I", "T_E"],
      "phone_start" : [1.860000, 1.980000, 2.100000],
      "start" : 1.860000,
      "word" : "LOT"
    }, {
      "conf" : 1.000000,
      "end" : 2.880000,
      "phone_end" : [2.880000],
      "phone_label" : ["SIL"],
      "phone_start" : [2.190000],
      "start" : 2.190000,
      "word" : "<eps>"
    }],
  "text" : " THE STUDENT'S STUDY A LOT"
}

May 04 '21 22:05 rutujaubale

Thank you, I'll try to merge coming week.

May 14 '21 19:05 nshmyrev

Thanks @nshmyrev! As an update, in our latest commit we have implemented a separate function to compute word and phone results if you want to generate phone level results aligned with word results (with confidences). The result option can now be configured using SetResultOptions() with words or phones as input. The C script under c/test_phone_results.c provides an example of how to set this option. Hope this looks much better!

May 26 '21 00:05 rutujaubale

Dear @nshmyrev , this functionality by @rutujaubale is fantastic!

I believe that many people would hugely appreciate it if you merged it any time soon just as you have previously planned. Thank you very much for your efforts.

Jun 08 '21 15:06 j-j-kam

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

Nov 11 '21 16:11 speechcon

I found this pull request, when I was looking into the same feature. I hope I can be merged soon, while it sill has no conflicts with the base branch.

That would be greatly appreciated!

Dec 07 '21 10:12 entenbein

Hi @rutujaubale , I'm trying to rebuild vosk with your modification. But I got an error to rebuild it.

It seems that the function you are using, CompactLatticeToWordProns, is neither defined by the kaldi vosk used (https://github.com/alphacep/kaldi/blob/master/src/lat/lattice-functions.cc) nor defined by the official kaldi (https://github.com/kaldi-asr/kaldi/blob/master/src/lat/lattice-functions.cc) now. But the kaldi doc still has info about this function (https://kaldi-asr.org/doc/namespacekaldi.html#a8a2110207264ab1d31c2b04150541834).

Could you please let me know which version of kaldi are you using? Or how to build vosk with your modification?

sh-4.1# KALDI_ROOT=/opt/kaldi make
g++ -g -O3 -std=c++17 -Wno-deprecated-declarations -fPIC -DFST_NO_DYNAMIC_LINKING -I. -I/opt/kaldi/src -I/opt/kaldi/tools/openfst/include   -I/opt/kaldi/tools/OpenBLAS/install/include -c -o kaldi_recognizer.o kaldi_recognizer.cc
kaldi_recognizer.cc: In function ‘void ComputePhoneInfo(const kaldi::TransitionModel&, const CompactLattice&, const fst::SymbolTable&, const fst::SymbolTable&, std::vector<std::vector<std::basic_string<char> > >*, std::vector<std::vector<int> >*)’:
kaldi_recognizer.cc:425:12: error: ‘CompactLatticeToWordProns’ is not a member of ‘kaldi’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
kaldi_recognizer.cc:425:12: note: suggested alternative: ‘CompactLatticeToWordAlignment’
     kaldi::CompactLatticeToWordProns(tmodel, best_path, &words_ph_ids, &times_lat, &lengths,&prons, phone_lengths);
            ^~~~~~~~~~~~~~~~~~~~~~~~~
            CompactLatticeToWordAlignment
make: *** [kaldi_recognizer.o] Error 1

Dec 07 '21 13:12 zhenxili96

@zhenxili96 I had to include lat/lattice-functions-transition-model.h in kaldi_recognizer.h to get it working like so:

#include "lat/lattice-functions-transition-model.h"

Dec 07 '21 13:12 mmende

Thanks @mmende, it really helps.

Dec 08 '21 03:12 zhenxili96

Hello, I also found this pull request while looking for the same feature. I figured I would leave a comment (given it has been a while, and no official comment appears to have been made) since it would be extremely useful to have this sort of functionality!

Jul 02 '22 02:07 x3a1n4

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

Sep 14 '22 13:09 Nathravorn

Is there anyway to get both phonemes and words at the same time for Spanish? I checked the two available Spanish models, neither of them have a phones.txt

Thanks and appreciate your help!

Jan 22 '23 16:01 ChenFangDart

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

Jan 22 '23 18:01 erikh2000

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.

Jun 01 '23 00:06 madhephaestus

Hey @nshmyrev, I'd love to see this merged too! Is there anything I could help with to get it through? I've fixed the merge conflicts and tested on my branch here: https://github.com/Nathravorn/vosk-api

I also fixed a few issues with this PR's code (most notably, the result_opts_ setting was not being taken into account).

Maybe just go ahead and ope your own PR if you have a merge-able version of this code? I would love to see this feature released and to use it!

Jun 01 '23 01:06 madhephaestus

@kevin, Rhubarb looks cool. I will check it out more later.

On Wed, May 31, 2023, 5:38 PM Kevin Harrington @.***> wrote:

I would also love to see this merged. I've written automatic lip synch animation software based on Vosk using word timings. The algorithm makes guesses about the timings of phonemes. It works really great, which is a testament to Vosk. But the lip synching would be much better if Vosk could return the phoneme timings.

Hey, i am looking for this exact thing! Is it possible that this is open source? I am trying to add lip syncing to TTS by listening to the audio stream and parsing out the phonemes. There is a project, https://github.com/DanielSWolf/rhubarb-lip-sync https://github.com/DanielSWolf/rhubarb-lip-sync that parses the audio into visemes. I would love to be able to do that live. If Vosk were able to merge this feature i would be able to get it working with an engine im already using.

— Reply to this email directly, view it on GitHub https://github.com/alphacep/vosk-api/pull/528#issuecomment-1571144220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB3GHRV7TYX7KYLCIDXEATXI7QBPANCNFSM44DRJWNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jun 01 '23 02:06 erikh2000

vosk-api vosk-api copied to clipboard

Added script to compute phoneme labels and timestamps

vosk-api
vosk-api copied to clipboard