snowfall Creating minibatch of supervision more efficiently

@csukuangfj this is something you could help with. Here, the supervision is created via a loop in Python. Previously we added a function, LinearFsas(), that creates an FsaVec from a Ragged<int32_t> representing a list of sequences. What we'd need in order to use this is to be able to create a Ragged<int32_t>, i.e. RaggedInt, from some kind of python data-structure, e.g. a list of lists of int, or possibly a pair of tensors. This would involve adding an appropriate constructor in PybindRaggedTpl.
And then of course wrap LinearFsas(), and probably LinearFsa() while we're at it.

This is initially a TODO within k2, not snowfall.

https://github.com/k2-fsa/snowfall/blob/70b8481ef947ee78ed3da5c4689f4d608479131a/egs/librispeech/asr/simple_v1/train.py#L33

Nov 12 '20 09:11 danpovey

I would suggest the following approach:

def create_decoding_graph(texts, graph, symbols):
    word_ids_list = []
    for text in texts:
        filter_text = [
            i if i in symbols._sym2id else '<UNK>' for i in text.split(' ')
        ]
        word_ids = [symbols.get(i) for i in filter_text]
        word_ids_list.append(word_ids)

    fsa = k2.linear_fsa(word_ids_list)
    decoding_graph = k2.intersect(fsa, graph).invert_()
    decoding_graph = k2.add_epsilon_self_loops(decoding_graph)
    return decoding_graph

k2.linear_fsa supports creating a single FSA as well as a vector of FSAs.

Nov 12 '20 09:11 csukuangfj

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

Nov 12 '20 09:11 csukuangfj

Oh, thanks! And I think we need to cache those graphs, maybe with cut_id + suquence_id as the key?

Nov 12 '20 09:11 qindazhu

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

That's related a bug we have before, it's generally for debug.

Nov 12 '20 09:11 qindazhu

Sure, caching the graphs makes sense.

On Thu, Nov 12, 2020 at 5:27 PM Haowen Qiu [email protected] wrote:

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

That's related a bug we have before, it's generally for debug.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/2#issuecomment-725956742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3OACRKLC3NUNTNOITSPOTBHANCNFSM4TTAITTA .

Nov 12 '20 11:11 danpovey

I'd actually suggest returning these graphs directly from the Lhotse DataLoader to have a clear separation between data preparation and the rest of the training loop. Assuming they can be created on the host and transferred to the device later.

Nov 12 '20 13:11 pzelasko

That might create a dependency on k2. Let's do it later, if at all...

On Thu, Nov 12, 2020 at 9:45 PM Piotr Żelasko [email protected] wrote:

I'd actually suggest returning these graphs directly from the Lhotse DataLoader to have a clear separation between data preparation and the rest of the training loop. Assuming they can be created on the host and transferred to the device later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/2#issuecomment-726086375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO72IYAIACX7RVX6GMTSPPRGZANCNFSM4TTAITTA .

Nov 12 '20 14:11 danpovey