snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Building lexicons in Python

Open pzelasko opened this issue 4 years ago • 16 comments
trafficstars

The current setup inherits building lexicon FSTs from Kaldi. I think it makes sense to have the ability to build it directly in Python, which should make building new recipes easier, as well as (eventually) allow for some things like dynamic expansion of the lexicon without leaving Python.

The data structure would basically resemble that of Kaldi, e.g.:

class Dict:
  # a list of words and their phone transcripts, possibly with scores to resemble lexiconp.txt
  lexicon: List[str, List[str]]

  # OOV word symbol
  oov: str

  # optional silence phone symbol
  optional_silence: str

  # a list of silence phone symbols (maybe we should call them special symbols? spoken noise is not really silence)
  silence_phones: List[str]

  # a list of nonsilence phone symbols
  nonsilence_phones: List[str]

  @property
  def words(self) -> List[str]:
    """A sorted list of unique words in Dict. Includes <eps>, #0, <s> and </s>"""
  
  @property
  def phones(self) -> List[str]:
    """A sorted list of unique phones in Dict."""

and methods:

def save(self, path):
  """Save into a file or a directory (maybe same as Kaldi's data dir)"""

@classmethod
def load(cls, path) -> 'Dict':
  """Read all the information from a path"""

def compile_lexicon_fst(self) -> k2.Fsa:
  """Adds disambiguation symbols and compiles L.fst"""

def extend(self, lexicon: List[str, List[str]]) -> k2.Fsa:
  """Adds new words and their corresponding phone transcripts into Dict. Checks for compatibility with the phone set."""

Kaldi's prepare_lang.sh has accumulated a lot of options, so I'd like to get some feedback which of them are useful to keep and which are not:

  • num sil/nonsil states and share_silence_phones are currently unused and probably not needed anymore?
  • position dependent phones seems superficial in our current setups, not sure if it'll be useful?
  • could unk-fst be still useful?
  • silprob/sil_prob - is it worth supporting it?

We can of course start from something minimal and extend it... It does seem like a substantial amount of work but I think it's worth it and I can give it a shot, or at least lay some groundwork. What do you guys think? Also, I want to make sure I wouldn't be duplicating anybody's effort.

pzelasko avatar May 11 '21 16:05 pzelasko

num sil/nonsil states relates to the topo, so probably doesn't belong in the dict.

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5. I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

unk-fst may still be useful, I guess, but I think we can leave it separate from Dict, for now at least.

silprob: my feeling is we may not need it since it can just be absorbed into the probability of silence in the acoustic model (if we're training with LF-MMI and other sequence criteria, removing it shouldn't remove any modeling power).

There is even a question whether the silence_phones / nonsilence_phones belongs in the Dict. It's not clear what uses we have for that right now. We do need the opt_sil, though, so we can turn the Dict into an Fst (note: None should be allowable).

Also: turning the Dict into an Fsa may not be the most efficient method of graph-building (at least for supervisions) One possibility is to turn the Dict into an FsaVec and introducing a new indexing operation whereby an FsaVec can be indexed by an Fsa or FsaVec. The idea is this: that an expression a[b] gives you something with the top-level structure of b, but where each arc in b with a label x is replaced by the Fsa a[x], with the start-state and final-state of a[x] being identified with the source-state and destination-state of the arc in b, and any additional states in a[x] being inserted somewhere in the result (e.g. just after the source-state of the arc). I would propose to have epsilon be treated as a normal symbol and element 0 of a being what we replace epsilon arcs with (would likely be just a single arc from start-state to final-state); the last element of a being used when the symbol in b is -1's; and -1 arcs in a being replaced with 0 if their destination-state in a[b] is not a final-state. This way, we could put the optional silence at the start of all the individual FSA's, and the final FSA in a would also have the optional-silence which may be present at end-of-sentence.

danpovey avatar May 12 '21 04:05 danpovey

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

danpovey avatar May 12 '21 04:05 danpovey

@csukuangfj I'll talk to Kangwei about doing this,

Cool!

csukuangfj avatar May 12 '21 06:05 csukuangfj

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5. I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

Even if there's no WER gain with position dependent phones, it's useful for fast lattice alignment.

francisr avatar May 12 '21 13:05 francisr

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

Cool! In that case, I won't start working on it to avoid duplicated effort.

pzelasko avatar May 12 '21 13:05 pzelasko

Clarification: For Kangwei, I was just talking about implementing that function involving indexing a FsaVec with Fsas or FsaVecs, in k2. This shouldn't be necessary for this lexicon-building Python code since we'll anyway need a way to turn the whole thing into an Fsa. If you feel like implementing that indexing thing, I won't say no, since Kangwei won't join us for a couple of weeks.

On Wed, May 12, 2021 at 9:52 PM Piotr Żelasko @.***> wrote:

@csukuangfj https://github.com/csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

Cool! In that case, I won't start working on it to avoid duplicated effort.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839790285, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYRSQMM6XXGHDQYDM3TNKBZFANCNFSM44WAIS5Q .

danpovey avatar May 12 '21 15:05 danpovey

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

pzelasko avatar May 12 '21 15:05 pzelasko

Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.

On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q .

jtrmal avatar May 12 '21 15:05 jtrmal

It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.

On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:

Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.

On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q .

jtrmal avatar May 12 '21 15:05 jtrmal

Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in?

I suggest the following C++ interface (a little different from what I said before):

/*
   Replace, in `index`, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0()
   with the Fsa indexed `label - symbol_range_begin` in `src`, identifying the source and destination
   states of the arc in `src` with the initial and final states in
`src[label - symbol_range_begin]`.
   Arcs with labels outside this range are just copied.  Caution: the
result may not be a valid Fsa
   because labels on final-arcs in `src` (which will be -1) may end
up on non-final arcs in
   the result; you can use FixFinalLabels() to fix this.

  @param [in] src   FsaVec containing individual Fsas that we'll be
     inserting into the result.  No FSA in `src` have arcs
     entering its initial state; this function will crash if this requirement
    is violated.
  @param [in] index  Fsa or FsaVec (2 or 3 axes) that dictates the overall
structure of the result
                     (the result will have the same number of axes as
`index`.
  @param [in] symbol_range_begin  Beginning of the range (interval) of
symbols that are to
                 be replaced with Fsas.  Symbols numbered
symbol_range_begin <= i < src.Dim0()
                will be replaced with the Fsa in `src[i -
symbol_range_begin]`
  @param [out,optional] arc_map_src  If not nullptr, will be set to a new
array that
             maps from arc-indexes in the result to the corresponding arc
             in `src`, or -1 if there was no such arc (for out-of-range
symbols in `index`)
   @param [out,optional] arc_map_index  If not nullptr, will be set to a
new array
            that maps from arc-indexes in the result to the arc in `index`
            that it originates from, only if it includes the weight from
that arc in `index`; and -1
            otherwise).  For arcs that result from inserting an Fsa in
`src`, (say, src[i]) they include the
            weight from the arc in `index` if the arc was from the initial state in src[i].

*/
FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin,
                                      Array1<int32_t> *arc_map_src =
nullptr,
                                      Array1<int32_t> *arc_map_index =
nullptr);

Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function.

Note on edits I just made: I removed RepairFinalSymbols() from the draft because there is now a FixFinalLabels() function that does the same thing; and I added the requirement that FSAs in src may not have arcs entering their initial state; and I simplified a comment about arc_map_index.

On Wed, May 12, 2021 at 11:43 PM jtrmal @.***> wrote:

It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.

On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:

Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.

On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839877686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q .

danpovey avatar May 13 '21 03:05 danpovey

And I haven't given a thought to the right way to handle auxiliary labels here. The easiest way is probably to have them inherited from index in the same way as the weights (via arc_map_index) and say they are disallowed in src for now

On Thu, May 13, 2021 at 11:58 AM Daniel Povey @.***> wrote:

Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in?

I suggest the following C++ interface (a little different from what I said before):

/* Replace, in index, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0() with the Fsa indexed label - symbol_range_begin in src, identifying the source and destination states of the arc in src with the initial and final states in src[label - symbol_range_begin]. Arcs with labels outside this range are just copied. Caution: the result may not be a valid Fsa because labels on final-arcs in src (which will likely be -1) may end up on non-final arcs in the result; you can use RepairFinalSymbols() to fix this.

@param [in] src FsaVec containing individual Fsas that we'll be inserting into the result. @param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall structure of the result (the result will have the same number of axes as index. @param [in] symbol_range_begin Beginning of the range (interval) of symbols that are to be replaced with Fsas. Symbols numbered symbol_range_begin <= i < src.Dim0() will be replaced with the Fsa in src[i - symbol_range_begin] @param [out,optional] arc_map_src If not nullptr, will be set to a new array that maps from arc-indexes in the result to the corresponding arc in src, or -1 if there was no such arc (for out-of-range symbols in index) @param [out,optional] arc_map_index If not nullptr, will be set to a new array that maps from arc-indexes in the result to the arc in index that it originates from, only if it includes the weight from that arc in index; and -1 otherwise). For arcs that result from inserting an Fsa in src, (say, src[i]) they include the weight from the arc in index if one of the following two conditions is true: - The arc was from the initial state in src[i], and src[i] has no arcs entering its initial state - The arc was to the final state in src[i], and src[i] has at least one arc entering its initial state */ FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin, Array1<int32_t> *arc_map_src = nullptr, Array1<int32_t> *arc_map_index = nullptr);

/* Ensures that labels on final-arcs in a are -1, and replaces labels on non-final arcs in a with nonfinal_label. */ void RepairFinalSymbols(FsaOrVec *a, int32_t nonfinal_label = 0);

You can decide whether to expose RepairFinalSymbols to Python via _k2, or make it part of the Python-level interface of ReplaceFsa.

Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function.

On Wed, May 12, 2021 at 11:43 PM jtrmal @.***> wrote:

It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.

On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal @.***> wrote:

Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.

On Wed, May 12, 2021 at 11:38 Piotr Żelasko @.***> wrote:

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839874067 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-839877686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q .

danpovey avatar May 13 '21 04:05 danpovey

@jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

pzelasko avatar May 17 '21 21:05 pzelasko

OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.

On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:

@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q .

jtrmal avatar May 17 '21 21:05 jtrmal

@Jan Trmal @.***> did you make any progress?

On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:

OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.

On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:

@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q .

danpovey avatar May 28 '21 04:05 danpovey

I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y.

On Fri, May 28, 2021 at 12:15 AM Daniel Povey @.***> wrote:

@Jan Trmal @.***> did you make any progress?

On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:

OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.

On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:

@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850118001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q .

jtrmal avatar May 28 '21 11:05 jtrmal

Great!

On Fri, May 28, 2021 at 7:49 PM jtrmal @.***> wrote:

I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y.

On Fri, May 28, 2021 at 12:15 AM Daniel Povey @.***> wrote:

@Jan Trmal @.***> did you make any progress?

On Tue, May 18, 2021 at 5:15 AM jtrmal @.***> wrote:

OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.

On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko @.***> wrote:

@jtrmal https://github.com/jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842642709 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/k2-fsa/snowfall/issues/191#issuecomment-842644706 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850118001, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/191#issuecomment-850362668, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYSSVHUBLIOWI5HV3DTP57MFANCNFSM44WAIS5Q .

danpovey avatar May 28 '21 12:05 danpovey