snowfall
snowfall copied to clipboard
How to combine the nenural net log-softmax outputs and fsa
Hello, I am reading train.py and decode.py. For me, It is difficult to know how to combine the nenural net log-softmax outputs and fsa. Could you provide some papers or description about that to help me understand, thanks. Here is the codes what I don't understand:
dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)
You can find some descriptions about it by visiting the following two links:
- https://github.com/k2-fsa/k2/blob/2dbb3e09b152fcf98354c946baa271e5b57c8321/k2/csrc/fsa.h#L114
/*
Vector of FSAs that actually will come from neural net log-softmax outputs (or
similar).
Conceptually this is a 3-dimensional tensor of log-probs with the second
dimension ragged, i.e. the shape would be [ num_fsas, None, num_symbols+1 ],
e.g. if this were a TF ragged tensor. The indexing would be
[fsa_idx,t,symbol+1], where the "+1" after the symbol is so that we have
somewhere to put the output for symbol == -1 (remember, -1 is kFinalSymbol,
used on the last frame).
Also, if a particular FSA has T frames of neural net output, we actually
have T+1 potential indexes, 0 through T, so there is space for the terminating
final-symbol on frame T. (On the last frame, the final symbol has
logprob=0, the others have logprob=-inf).
*/
- https://github.com/k2-fsa/k2/blob/2dbb3e09b152fcf98354c946baa271e5b57c8321/k2/python/k2/dense_fsa_vec.py#L15
class DenseFsaVec(object):
def __init__(self, log_probs: torch.Tensor,
supervision_segments: torch.Tensor) -> None:
'''Construct a DenseFsaVec from neural net log-softmax outputs.
Args:
log_probs:
A 3-D tensor of dtype ``torch.float32`` with shape ``(N, T, C)``,
where ``N`` is the number of sequences, ``T`` the maximum input
length, and ``C`` the number of output classes.
supervision_segments:
A 2-D **CPU** tensor of dtype ``torch.int32`` with 3 columns.
Each row contains information for a supervision segment. Column 0
is the ``sequence_index`` indicating which sequence this segment
comes from; column 1 specifies the ``start_frame`` of this segment
within the sequence; column 2 contains the ``duration`` of this
segment.
Note:
- ``0 < start_frame + duration <= T``
- ``0 <= start_frame < T``
- ``duration > 0``
'''
Mm, thanks , I have seen these two material. But it is too little for me. Could you provide other material?
I am writing tutorials for k2
. Please just wait for a few days.
@Curisan just add some notes in case you are eager to learn about this before fangjun's documentation.
When we train or decode, usually we feed data into nnet model batch by batch, we prepare batch data with K2SpeechRecognitionIterableDataset
in lhotse
https://github.com/lhotse-speech/lhotse/blob/08c31c3bd2711d4b6c614d64a1d3c26abb892a37/lhotse/dataset/speech_recognition.py#L86-L94
You can see that a batch is a few of Cuts
and each Cut
may have multiple supervisions, so the question we have now is: after feeding feature (N, T, C_feature)
into nnet and getting nnet_output (N, T, C_nnet_output)
, we need to know which part of nnet_output
corresponds to each supervision, right? This is exactly what k2.DenseFsaVec(nnet_output, supervision_segments)
does. As supervision_segments
gives the info of seq_idx
(corresponds to N
in nnet_output
), start_frames
and num_frames
(corresponds to T
in nnet_output
), then we can easily get the part of nnet_output
for each supervision with those information in DenseFsaVec
(of course if we do subsampling in model like tdnn, we need to do the same subsampling for start_frame
and num_frames
as well.)
Then in DenseFsaVec
, for each supervision (with the corresponding part of nnet_ouput
), we'll create an DenseFsa
(Hopefully you have understood the format of DenseFsa
with the documentation in fsa.h
, but you can also view it as a normal Fsa, they are equivalent from the perpective of Fsa concept). So next step we'll call intersect_(pruned)
to intersect the DenseFsa
with the decoding_grah
to get the lattice, then get the tot_scores or best_path for training or decoding.
You may want to check test code in k2/python/tests or test code in lhotse
to get know well about the data format. Feel free to ping us if there's any question.
Thank you very much.
@Curisan There is some documentation about dense fsa vector available at https://k3.readthedocs.io/en/latest/core_concepts/index.html
Please let us know whether it is clear or need more clarification.
Great!
Could you provide some papers or description about that to help me understand
Here is a paper I just found that is relevant about it:
- Generating exact lattices in the WFST framework, https://www.danielpovey.com/files/2012_icassp_lattices.pdf
Figure 1 from the paper shows what DenseFsaVec looks like. It is called "the search graph of the utterance" in the paper.
In that paper the DenseFsaVec would be " Acceptor U describing the acoustic scores of an utterance" In k2, so far we are dealing only with state-level lattices, not determinized lattices. The "search graph of the utterance" (S = U o HCLG) is the result of calling IntersectDensePruned().
On Sun, Jan 24, 2021 at 10:20 PM Fangjun Kuang [email protected] wrote:
Could you provide some papers or description about that to help me understand
Here is a paper I just found that is relevant about it:
- Generating exact lattices in the WFST framework, https://www.danielpovey.com/files/2012_icassp_lattices.pdf
Figure 1 from the paper shows what DenseFsaVec looks like. It is called "the search graph of the utterance" in the paper.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/44#issuecomment-766355840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6C4TZGSIMM6HDWVODS3QUD3ANCNFSM4U5LJLYA .
I see. Thanks.