snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Creating minibatch of supervision more efficiently

Open danpovey opened this issue 5 years ago • 7 comments

@csukuangfj this is something you could help with. Here, the supervision is created via a loop in Python. Previously we added a function, LinearFsas(), that creates an FsaVec from a Ragged<int32_t> representing a list of sequences. What we'd need in order to use this is to be able to create a Ragged<int32_t>, i.e. RaggedInt, from some kind of python data-structure, e.g. a list of lists of int, or possibly a pair of tensors. This would involve adding an appropriate constructor in PybindRaggedTpl.
And then of course wrap LinearFsas(), and probably LinearFsa() while we're at it.

This is initially a TODO within k2, not snowfall.

https://github.com/k2-fsa/snowfall/blob/70b8481ef947ee78ed3da5c4689f4d608479131a/egs/librispeech/asr/simple_v1/train.py#L33

danpovey avatar Nov 12 '20 09:11 danpovey

I would suggest the following approach:

def create_decoding_graph(texts, graph, symbols):
    word_ids_list = []
    for text in texts:
        filter_text = [
            i if i in symbols._sym2id else '<UNK>' for i in text.split(' ')
        ]
        word_ids = [symbols.get(i) for i in filter_text]
        word_ids_list.append(word_ids)

    fsa = k2.linear_fsa(word_ids_list)
    decoding_graph = k2.intersect(fsa, graph).invert_()
    decoding_graph = k2.add_epsilon_self_loops(decoding_graph)
    return decoding_graph

k2.linear_fsa supports creating a single FSA as well as a vector of FSAs.

csukuangfj avatar Nov 12 '20 09:11 csukuangfj

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

csukuangfj avatar Nov 12 '20 09:11 csukuangfj

Oh, thanks! And I think we need to cache those graphs, maybe with cut_id + suquence_id as the key?

qindazhu avatar Nov 12 '20 09:11 qindazhu

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

That's related a bug we have before, it's generally for debug.

qindazhu avatar Nov 12 '20 09:11 qindazhu

Sure, caching the graphs makes sense.

On Thu, Nov 12, 2020 at 5:27 PM Haowen Qiu [email protected] wrote:

A linear FSA is always arc sorted, so I think it is not necessary to sort it before calling k2.intersect.

That's related a bug we have before, it's generally for debug.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/2#issuecomment-725956742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3OACRKLC3NUNTNOITSPOTBHANCNFSM4TTAITTA .

danpovey avatar Nov 12 '20 11:11 danpovey

I'd actually suggest returning these graphs directly from the Lhotse DataLoader to have a clear separation between data preparation and the rest of the training loop. Assuming they can be created on the host and transferred to the device later.

pzelasko avatar Nov 12 '20 13:11 pzelasko

That might create a dependency on k2. Let's do it later, if at all...

On Thu, Nov 12, 2020 at 9:45 PM Piotr Żelasko [email protected] wrote:

I'd actually suggest returning these graphs directly from the Lhotse DataLoader to have a clear separation between data preparation and the rest of the training loop. Assuming they can be created on the host and transferred to the device later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/2#issuecomment-726086375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO72IYAIACX7RVX6GMTSPPRGZANCNFSM4TTAITTA .

danpovey avatar Nov 12 '20 14:11 danpovey