snowfall CTC training speed question

trafficstars

Hi, for my experiment, built-in (cudnnctc) is about 2.5 times fast than k2-ctc. I was wondering if this is normal and would like to make sure my program is correct.

I found that decoding_graph = k2.compose(self.ctc_topo, label_graph, treat_epsilons_specially=False) is the reason even with build_ctc_topo2 https://github.com/k2-fsa/snowfall/pull/209. How about considering to construct ctc loss directly from text rather than compsing the topo FST and label FST. In my experiment, it would give a very similar speed with cudnnctc.

Like the below pic: Screenshot2021_06_28_100754

Jun 28 '21 02:06 yuekaizhang

Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?

Jun 28 '21 08:06 danpovey

Shall we also consider the transition probability contained in the bigram P while constructing the graph for LF-MMI training? (It's not an issue for CTC training.)

Jun 28 '21 09:06 csukuangfj

Perhaps for LF-MMI it would be best to use our current code so it takes care of that. But last time I checked, graph compilation does not actually take much time since we batch things up.

Max, perhaps you could show us what code you are using for graph compilation, e.g. are you compiling these things individually or as a batch?

On Mon, Jun 28, 2021 at 5:20 PM Fangjun Kuang @.***> wrote:

Shall we also consider the transition probability contained in the bigram P while constructing the graph for LF-MMI training? (It's not an issue for CTC training.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/220#issuecomment-869522278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZR3TKA7SHPGRI5HKTTVA5HHANCNFSM47NBAV5Q .

Jun 28 '21 09:06 danpovey

Hi, Dan, my comparison is based on the code: https://github.com/k2-fsa/snowfall/blob/2dda31e14039a79b77c89bcd3bb96d52cbf60c8a/snowfall/training/ctc_graph.py#L108-L127. I used the below code to avoid composing operation. I was wondering any reference code to compile them as a batch. Thanks.

def _compile_one_and_cache_v2(self, text: torch.Tensor) -> k2.Fsa:
    text = text.tolist()
    blank_idx = 0
    num_tokens = len(text)
    S = 2 * num_tokens + 1
    final = S + 1
    arcs = []
    arcs.append([final])
    for s in range(S):
        idx = (s-1) // 2
        word_id = text[idx] if s % 2 else blank_idx
        arcs.append([s,s,word_id,word_id,0]) 
        if s > 0:
            arcs.append([s-1,s,word_id,word_id,0]) 
        if s % 2 and s > 1 and word_id != text[idx - 1]:
            arcs.append([s-2,s,word_id,word_id,0]) 
    arcs.append([S-2,final,-1,-1,0]) 
    arcs.append([S-1,final,-1,-1,0]) 
    arcs = sorted(arcs, key=lambda arc: arc[0])
    arcs = [[str(i) for i in arc] for arc in arcs]
    arcs = [' '.join(arc) for arc in arcs]
    arcs = '\n'.join(arcs)
    ctc_topo = k2.Fsa.from_str(arcs, False)
    return k2.arc_sort(ctc_topo).to(self.device)

Jun 28 '21 10:06 yuekaizhang

That code is doing composition on CPU.

Could you try https://github.com/k2-fsa/snowfall/blob/2dda31e14039a79b77c89bcd3bb96d52cbf60c8a/snowfall/training/mmi_graph.py#L150-L181

which is run on GPU.

Jun 28 '21 10:06 csukuangfj

Thanks. I could try this way to compose. Actually, for my code, I follow this requirement When treat_epsilons_specially is True, this function works only on CPU. When treat_epsilons_specially is False and both a_fsa and b_fsa are on GPU, then this function works on GPU So, I do k2.compose(self.ctc_topo.todevice("cuda"), decoding_graph.todevice("cuda"), treat_epsilons=False). I think its on GPU according to the doc?

Jun 28 '21 10:06 yuekaizhang

Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?

Sure, I will.

Jun 28 '21 11:06 pkufool

I was wondering any reference code to compile them as a batch.

I am afraid that has to be done in C++.

So, I do k2.compose(self.ctc_topo.todevice("cuda"), decoding_graph.todevice("cuda"), treat_epsilons=False). I think its on GPU according to the doc?

Yes, it is run on GPU. It would be more efficient if you

(1) Move ctc_topo to GPU inside the constructor, e.g., in __init__.
(2) Construct the decoding graph on GPU directly, not to move it after construction.

Jun 28 '21 11:06 csukuangfj

Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?

@danpovey Do you mean constructing the decoding_graphs for texts in a batch rather than call compile_one_and_cache() for several times.

Jun 29 '21 14:06 pkufool

Yes, I'm talking about constructing it for a batch at a time; in general all our FSA functions work for a batch (of course people can use a batch of one if needed). This function will be very fast so there is no problem re-doing the work on each minibatch.

Jun 30 '21 02:06 danpovey

snowfall snowfall copied to clipboard

CTC training speed question

snowfall
snowfall copied to clipboard