k2 icon indicating copy to clipboard operation
k2 copied to clipboard

Big gap in WER between online and offline CTC decoding

Open chiendb97 opened this issue 1 year ago • 36 comments

I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:

  • For model librispeech conformer ctc: offline decoding: 3.49% WER, online decoding: 19.08% WER
  • For our model: offline decoding: ~3% WER, online decoding: ~18% WER (WER online decoding is much larger than offline decoding (both use the same am output), online decoding uses chunk size 16)

Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding. Thanks!

chiendb97 avatar May 11 '23 15:05 chiendb97

There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point? Normally you need to use a model that has been trained with streaming in mind.

danpovey avatar May 11 '23 15:05 danpovey

There are examples in Sherpa of real-time/streaming/online decoding

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Normally you need to use a model that has been trained with streaming in mind.

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

chiendb97 avatar May 11 '23 15:05 chiendb97

Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.

Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).

I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.

We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method.

pkufool avatar May 12 '23 05:05 pkufool

But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice).

danpovey avatar May 13 '23 03:05 danpovey

@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases. I tried to print out the lattice (lattice.fsa.values) of the online decoder and noticed that the first few lattices are quite the same as that of the offline decoder. But then it started to differ.

chiendb97 avatar May 13 '23 07:05 chiendb97

hm, how did it differ? @pkufool do you think there is possibly a bug that is affecting him? @chiendb97 what version of k2 are you using? see if a newer version helps.

danpovey avatar May 13 '23 07:05 danpovey

what version of k2 are you using? see if a newer version helps.

I am using the latest version of k2.

chiendb97 avatar May 13 '23 08:05 chiendb97

@pkufool do you think there is possibly a bug that is affecting him?

Yes, I think there could be some bugs. I will look into the code.

pkufool avatar May 15 '23 02:05 pkufool

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

svandiekendialpad avatar May 16 '23 20:05 svandiekendialpad

I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter increases WER by an unreasonable amount with almost all new errors coming from deletions.

OK, I am debuging it.

pkufool avatar May 17 '23 02:05 pkufool

Any updates @pkufool?

svandiekendialpad avatar Jun 27 '23 16:06 svandiekendialpad

Any updates @pkufool?

Sorry, I did not fix it at that day and forgot it, will return to it.

pkufool avatar Jun 28 '23 01:06 pkufool

@svandiekendialpad @chiendb97 Does the differences only happens when using --use_ctc_decoding=false (i.e decoding with an ngram) ?

pkufool avatar Jul 04 '23 02:07 pkufool

Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:

  • Using --use_ctc_decoding=true, I got WER=7.3%.
  • Using offline ctc decoding in ctc_decode.cu, I got WER=2.6%.

So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller).

binhtranmcs avatar Jul 04 '23 03:07 binhtranmcs

I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code.

svandiekendialpad avatar Jul 04 '23 19:07 svandiekendialpad

@binhtranmcs I think https://github.com/k2-fsa/k2/pull/1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it.

pkufool avatar Jul 11 '23 08:07 pkufool

@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

pkufool avatar Jul 11 '23 08:07 pkufool

Does the backward pass start with -(forward score) on all active states? That's how it is supposed to work.

On Tue, Jul 11, 2023, 10:20 AM Wei Kang @.***> wrote:

@danpovey https://github.com/danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1630370874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2RPNA2A6ULSP7WKH3XPUEF5ANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>

danpovey avatar Jul 11 '23 08:07 danpovey

Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

binhtranmcs avatar Jul 16 '23 09:07 binhtranmcs

I think it is described in my paper about exact lattices.. or at least mentioned there. Pruned viterbi beam search with some extensions to store a lattice. The guys have discovered the problem but IDK if they have made the fix public yet.

On Sun, Jul 16, 2023, 5:56 PM binhtranmcs @.***> wrote:

Hi @danpovey https://github.com/danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1637040507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7J4QLWBIEAYHIIKDLXQO3DVANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>

danpovey avatar Jul 16 '23 10:07 danpovey

@binhtranmcs @svandiekendialpad @chiendb97 I think https://github.com/k2-fsa/k2/pull/1218 can fix this issue, you can try it on your dataset.

pkufool avatar Jul 19 '23 09:07 pkufool

@pkufool, I just tested again with librispeech conformer ctc, using online_decode.cu:

  • With --use_ctc_decoding=true, WER=7.3%.
  • With --use_ctc_decoding=false, WER=12.2%.

WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here.

binhtranmcs avatar Jul 19 '23 11:07 binhtranmcs

For me it went up from 33% to 45%, when 14% should be normal. Should I have used allow_partial anywhere? I just left it at its default (true in OnlineDenseIntersecter).

svandiekendialpad avatar Jul 19 '23 23:07 svandiekendialpad

@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets.

pkufool avatar Jul 20 '23 01:07 pkufool

Hi @pkufool, are there any updates on this???

binhtranmcs avatar Jul 31 '23 03:07 binhtranmcs

I think #1218 may be relevant to this. Not merged yet but says it is ready.

danpovey avatar Aug 01 '23 02:08 danpovey

I think #1218 may be relevant to this. Not merged yet but says it is ready.

It a pity that the fixes in #1218 can not fix all the issue, I am still debuging it.

pkufool avatar Aug 01 '23 10:08 pkufool

I did some exps on librispeech test-clean, here is the results: For ctc-decoding (decode with a ctc-topology),after applying the fixes in #1218 I can get almost the same WERs for online and offline.

  Offline Online (chunk=10)
Ctc-decoding 2.99 2.92

For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.

  Offline Online(chunk=10) Online(chunk=30) Online(chunk=50) Online(chunk=30)decoding_graph.scores = 0.0
Hlg decoding 2.77 19.06 6.93 5.13 3.02

I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the output_beam (used in backward pruning) the same as the search_beam (used in forward pruning) I can get the same results.

  Offline Online(chunk=10) Online(chunk=10) output-beam=search-beam
Hlg decoding 2.77 19.06 2.73

I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same output_beam and search_beam.

[edit:] BTW, I add the python test code in #1218 online_decode.py and hlg_decode.py which accept a wav scp, then you can use simple-wer to calculate the WERs.

pkufool avatar Aug 03 '23 11:08 pkufool

@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct.

danpovey avatar Aug 03 '23 14:08 danpovey

@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue.

pkufool avatar Aug 07 '23 13:08 pkufool