snowfall
snowfall copied to clipboard
CTC training speed question
Hi, for my experiment, built-in (cudnnctc) is about 2.5 times fast than k2-ctc. I was wondering if this is normal and would like to make sure my program is correct.
I found that decoding_graph = k2.compose(self.ctc_topo, label_graph, treat_epsilons_specially=False) is the reason even with build_ctc_topo2 https://github.com/k2-fsa/snowfall/pull/209. How about considering to construct ctc loss directly from text rather than compsing the topo FST and label FST. In my experiment, it would give a very similar speed with cudnnctc.
Like the below pic:

Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?
Shall we also consider the transition probability contained in the bigram P while constructing the graph for LF-MMI training? (It's not an issue for CTC training.)
Perhaps for LF-MMI it would be best to use our current code so it takes care of that. But last time I checked, graph compilation does not actually take much time since we batch things up.
Max, perhaps you could show us what code you are using for graph compilation, e.g. are you compiling these things individually or as a batch?
On Mon, Jun 28, 2021 at 5:20 PM Fangjun Kuang @.***> wrote:
Shall we also consider the transition probability contained in the bigram P while constructing the graph for LF-MMI training? (It's not an issue for CTC training.)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/220#issuecomment-869522278, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZR3TKA7SHPGRI5HKTTVA5HHANCNFSM47NBAV5Q .
Hi, Dan, my comparison is based on the code: https://github.com/k2-fsa/snowfall/blob/2dda31e14039a79b77c89bcd3bb96d52cbf60c8a/snowfall/training/ctc_graph.py#L108-L127. I used the below code to avoid composing operation. I was wondering any reference code to compile them as a batch. Thanks.
def _compile_one_and_cache_v2(self, text: torch.Tensor) -> k2.Fsa:
text = text.tolist()
blank_idx = 0
num_tokens = len(text)
S = 2 * num_tokens + 1
final = S + 1
arcs = []
arcs.append([final])
for s in range(S):
idx = (s-1) // 2
word_id = text[idx] if s % 2 else blank_idx
arcs.append([s,s,word_id,word_id,0])
if s > 0:
arcs.append([s-1,s,word_id,word_id,0])
if s % 2 and s > 1 and word_id != text[idx - 1]:
arcs.append([s-2,s,word_id,word_id,0])
arcs.append([S-2,final,-1,-1,0])
arcs.append([S-1,final,-1,-1,0])
arcs = sorted(arcs, key=lambda arc: arc[0])
arcs = [[str(i) for i in arc] for arc in arcs]
arcs = [' '.join(arc) for arc in arcs]
arcs = '\n'.join(arcs)
ctc_topo = k2.Fsa.from_str(arcs, False)
return k2.arc_sort(ctc_topo).to(self.device)
That code is doing composition on CPU.
Could you try https://github.com/k2-fsa/snowfall/blob/2dda31e14039a79b77c89bcd3bb96d52cbf60c8a/snowfall/training/mmi_graph.py#L150-L181
which is run on GPU.
Thanks. I could try this way to compose. Actually, for my code, I follow this requirement When treat_epsilons_specially is True, this function works only on CPU. When treat_epsilons_specially is False and both a_fsa and b_fsa are on GPU, then this function works on GPU So, I do k2.compose(self.ctc_topo.todevice("cuda"), decoding_graph.todevice("cuda"), treat_epsilons=False). I think its on GPU according to the doc?
Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?
Sure, I will.
I was wondering any reference code to compile them as a batch.
I am afraid that has to be done in C++.
So, I do k2.compose(self.ctc_topo.todevice("cuda"), decoding_graph.todevice("cuda"), treat_epsilons=False). I think its on GPU according to the doc?
Yes, it is run on GPU. It would be more efficient if you
- (1) Move ctc_topo to GPU inside the constructor, e.g., in
__init__. - (2) Construct the decoding graph on GPU directly, not to move it after construction.
Thanks for doing the comparison, and sure, that's a good idea. Yes, we should introduce a special-purpose function that constructs a batch of CTC graphs from a ragged tensor consisting of the linear symbol sequences for each one. Perhaps @pkufool could work on that?
@danpovey Do you mean constructing the decoding_graphs for texts in a batch rather than call compile_one_and_cache() for several times.
Yes, I'm talking about constructing it for a batch at a time; in general all our FSA functions work for a batch (of course people can use a batch of one if needed). This function will be very fast so there is no problem re-doing the work on each minibatch.