snowfall
snowfall copied to clipboard
Building lexicons in Python
The current setup inherits building lexicon FSTs from Kaldi. I think it makes sense to have the ability to build it directly in Python, which should make building new recipes easier, as well as (eventually) allow for some things like dynamic expansion of the lexicon without leaving Python.
The data structure would basically resemble that of Kaldi, e.g.:
class Dict:
# a list of words and their phone transcripts, possibly with scores to resemble lexiconp.txt
lexicon: List[str, List[str]]
# OOV word symbol
oov: str
# optional silence phone symbol
optional_silence: str
# a list of silence phone symbols (maybe we should call them special symbols? spoken noise is not really silence)
silence_phones: List[str]
# a list of nonsilence phone symbols
nonsilence_phones: List[str]
@property
def words(self) -> List[str]:
"""A sorted list of unique words in Dict. Includes <eps>, #0, <s> and </s>"""
@property
def phones(self) -> List[str]:
"""A sorted list of unique phones in Dict."""
and methods:
def save(self, path):
"""Save into a file or a directory (maybe same as Kaldi's data dir)"""
@classmethod
def load(cls, path) -> 'Dict':
"""Read all the information from a path"""
def compile_lexicon_fst(self) -> k2.Fsa:
"""Adds disambiguation symbols and compiles L.fst"""
def extend(self, lexicon: List[str, List[str]]) -> k2.Fsa:
"""Adds new words and their corresponding phone transcripts into Dict. Checks for compatibility with the phone set."""
Kaldi's prepare_lang.sh has accumulated a lot of options, so I'd like to get some feedback which of them are useful to keep and which are not:
num sil/nonsil statesandshare_silence_phonesare currently unused and probably not needed anymore?position dependent phonesseems superficial in our current setups, not sure if it'll be useful?- could
unk-fstbe still useful? silprob/sil_prob- is it worth supporting it?
We can of course start from something minimal and extend it... It does seem like a substantial amount of work but I think it's worth it and I can give it a shot, or at least lay some groundwork. What do you guys think? Also, I want to make sure I wouldn't be duplicating anybody's effort.
num sil/nonsil states relates to the topo, so probably doesn't belong in the dict.
If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5. I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.
unk-fst may still be useful, I guess, but I think we can leave it separate from Dict, for now at least.
silprob: my feeling is we may not need it since it can just be absorbed into the probability of silence in the acoustic model (if we're training with LF-MMI and other sequence criteria, removing it shouldn't remove any modeling power).
There is even a question whether the silence_phones / nonsilence_phones belongs in the Dict. It's not clear what uses we have for that right now. We do need the opt_sil, though, so we can turn the Dict into an Fst (note: None should be allowable).
Also: turning the Dict into an Fsa may not be the most efficient method of graph-building (at least for supervisions) One possibility is to turn the Dict into an FsaVec and introducing a new indexing operation whereby an FsaVec can be indexed by an Fsa or FsaVec. The idea is this: that an expression a[b] gives you something with the top-level structure of b, but where each arc in b with a label x is replaced by the Fsa a[x], with the start-state and final-state of a[x] being identified with the source-state and destination-state of the arc in b, and any additional states in a[x] being inserted somewhere in the result (e.g. just after the source-state of the arc). I would propose to have epsilon be treated as a normal symbol and element 0 of a being what we replace epsilon arcs with (would likely be just a single arc from start-state to final-state); the last element of a being used when the symbol in b is -1's; and -1 arcs in a being replaced with 0 if their destination-state in a[b] is not a final-state. This way, we could put the optional silence at the start of all the individual FSA's, and the final FSA in a would also have the optional-silence which may be present at end-of-sentence.
@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.
@csukuangfj I'll talk to Kangwei about doing this,
Cool!
If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5. I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.
Even if there's no WER gain with position dependent phones, it's useful for fast lattice alignment.
@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.
Cool! In that case, I won't start working on it to avoid duplicated effort.
Clarification: For Kangwei, I was just talking about implementing that function involving indexing a FsaVec with Fsas or FsaVecs, in k2. This shouldn't be necessary for this lexicon-building Python code since we'll anyway need a way to turn the whole thing into an Fsa. If you feel like implementing that indexing thing, I won't say no, since Kangwei won't join us for a couple of weeks.
On Wed, May 12, 2021 at 9:52 PM Piotr Żelasko @.***> wrote:
@csukuangfj https://github.com/csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.
Cool! In that case, I won't start working on it to avoid duplicated effort.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839790285, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYRSQMM6XXGHDQYDM3TNKBZFANCNFSM44WAIS5Q .
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.
Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.
On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q .
It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.
On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:
Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.
On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q .
Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in?
I suggest the following C++ interface (a little different from what I said before):
/*
Replace, in `index`, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0()
with the Fsa indexed `label - symbol_range_begin` in `src`, identifying the source and destination
states of the arc in `src` with the initial and final states in
`src[label - symbol_range_begin]`.
Arcs with labels outside this range are just copied. Caution: the
result may not be a valid Fsa
because labels on final-arcs in `src` (which will be -1) may end
up on non-final arcs in
the result; you can use FixFinalLabels() to fix this.
@param [in] src FsaVec containing individual Fsas that we'll be
inserting into the result. No FSA in `src` have arcs
entering its initial state; this function will crash if this requirement
is violated.
@param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall
structure of the result
(the result will have the same number of axes as
`index`.
@param [in] symbol_range_begin Beginning of the range (interval) of
symbols that are to
be replaced with Fsas. Symbols numbered
symbol_range_begin <= i < src.Dim0()
will be replaced with the Fsa in `src[i -
symbol_range_begin]`
@param [out,optional] arc_map_src If not nullptr, will be set to a new
array that
maps from arc-indexes in the result to the corresponding arc
in `src`, or -1 if there was no such arc (for out-of-range
symbols in `index`)
@param [out,optional] arc_map_index If not nullptr, will be set to a
new array
that maps from arc-indexes in the result to the arc in `index`
that it originates from, only if it includes the weight from
that arc in `index`; and -1
otherwise). For arcs that result from inserting an Fsa in
`src`, (say, src[i]) they include the
weight from the arc in `index` if the arc was from the initial state in src[i].
*/
FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin,
Array1<int32_t> *arc_map_src =
nullptr,
Array1<int32_t> *arc_map_index =
nullptr);
Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function.
Note on edits I just made: I removed RepairFinalSymbols() from the draft because there is now a FixFinalLabels() function
that does the same thing; and I added the requirement that FSAs in src may not have arcs entering their initial state; and I simplified a comment about arc_map_index.
On Wed, May 12, 2021 at 11:43 PM jtrmal @.***> wrote:
It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.
On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:
Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.
On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839877686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q .
And I haven't given a thought to the right way to handle auxiliary labels
here. The easiest way is probably to have them inherited from index in
the
same way as the weights (via arc_map_index) and say they are disallowed in
src for now
On Thu, May 13, 2021 at 11:58 AM Daniel Povey @.***> wrote:
Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in?
I suggest the following C++ interface (a little different from what I said before):
/* Replace, in
index, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0() with the Fsa indexedlabel - symbol_range_begininsrc, identifying the source and destination states of the arc insrcwith the initial and final states insrc[label - symbol_range_begin]. Arcs with labels outside this range are just copied. Caution: the result may not be a valid Fsa because labels on final-arcs insrc(which will likely be -1) may end up on non-final arcs in the result; you can use RepairFinalSymbols() to fix this.@param [in] src FsaVec containing individual Fsas that we'll be inserting into the result. @param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall structure of the result (the result will have the same number of axes as
index. @param [in] symbol_range_begin Beginning of the range (interval) of symbols that are to be replaced with Fsas. Symbols numbered symbol_range_begin <= i < src.Dim0() will be replaced with the Fsa insrc[i - symbol_range_begin]@param [out,optional] arc_map_src If not nullptr, will be set to a new array that maps from arc-indexes in the result to the corresponding arc insrc, or -1 if there was no such arc (for out-of-range symbols inindex) @param [out,optional] arc_map_index If not nullptr, will be set to a new array that maps from arc-indexes in the result to the arc inindexthat it originates from, only if it includes the weight from that arc inindex; and -1 otherwise). For arcs that result from inserting an Fsa insrc, (say, src[i]) they include the weight from the arc inindexif one of the following two conditions is true: - The arc was from the initial state in src[i], and src[i] has no arcs entering its initial state - The arc was to the final state in src[i], and src[i] has at least one arc entering its initial state */ FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin, Array1<int32_t> *arc_map_src = nullptr, Array1<int32_t> *arc_map_index = nullptr);/* Ensures that labels on final-arcs in
aare -1, and replaces labels on non-final arcs inawithnonfinal_label. */ void RepairFinalSymbols(FsaOrVec *a, int32_t nonfinal_label = 0);You can decide whether to expose RepairFinalSymbols to Python via _k2, or make it part of the Python-level interface of ReplaceFsa.
Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function.
On Wed, May 12, 2021 at 11:43 PM jtrmal @.***> wrote:
It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.
On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:
Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.
On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839877686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q .
@jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).
OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.
On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:
@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q .
@Jan Trmal @.***> did you make any progress?
On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:
OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.
On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:
@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q .
I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y.
On Fri, May 28, 2021 at 12:15 AM Daniel Povey @.***> wrote:
@Jan Trmal @.***> did you make any progress?
On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:
OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.
On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:
@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850118001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q .
Great!
On Fri, May 28, 2021 at 7:49 PM jtrmal @.***> wrote:
I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y.
On Fri, May 28, 2021 at 12:15 AM Daniel Povey @.***> wrote:
@Jan Trmal @.***> did you make any progress?
On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:
OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.
On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:
@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850118001, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850362668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYSSVHUBLIOWI5HV3DTP57MFANCNFSM44WAIS5Q .