k2
k2 copied to clipboard
Big gap in WER between online and offline CTC decoding
I tried offline decoding using hlg_decode.cu and online decoding using online_decode.cu. And here is the result:
- For model librispeech conformer ctc: offline decoding: 3.49% WER, online decoding: 19.08% WER
- For our model: offline decoding: ~3% WER, online decoding: ~18% WER (WER online decoding is much larger than offline decoding (both use the same am output), online decoding uses chunk size 16)
Could you please tell me the difference between offline decoding and online decoding? In addition, could you tell us the result of 2 kinds of decoding. Thanks!
There are examples in Sherpa of real-time/streaming/online decoding, I think that might be a better starting point? Normally you need to use a model that has been trained with streaming in mind.
There are examples in Sherpa of real-time/streaming/online decoding
Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.
Normally you need to use a model that has been trained with streaming in mind.
I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.
Can you please specify which example it is? I did look into sherpa repo but did not find any examples about CTC-based streaming.
Sorry, there is no ctc hlg streaming decoding in Sherpa, only one example in k2/torch/bin (I think it is the online_deocde.cu you used).
I used the same AM output for both offline and streaming decoding. I don't think the gap can be that big.
We normally test the streaming decoding method with a streaming model, may be you can try the online_decode.cu with a streaming model. A offline model is not suitable for a streaming decoding method.
But @pkufool I think that binary just evaluates the nnet for the entire file and simulates streaming, so surely it should in principle give the same results as the offline decoding if it was given a non-streaming model? (Even though this would not be useful in practice).
@pkufool @danpovey How I tested was that I read the audio file and evaluated nnet output for the entire audio. Then I used that output to simulate streaming as in online_decode.cu and used the final text result to compute the WER. I did the test twice, using the conformer ctc model from icefall and my conformer ctc model (using wenet). However, the results obtained were not as good as offline decoding in both cases. I tried to print out the lattice (lattice.fsa.values) of the online decoder and noticed that the first few lattices are quite the same as that of the offline decoder. But then it started to differ.
hm, how did it differ? @pkufool do you think there is possibly a bug that is affecting him? @chiendb97 what version of k2 are you using? see if a newer version helps.
what version of k2 are you using? see if a newer version helps.
I am using the latest version of k2.
@pkufool do you think there is possibly a bug that is affecting him?
Yes, I think there could be some bugs. I will look into the code.
I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using OnlineDenseIntersecter
increases WER by an unreasonable amount with almost all new errors coming from deletions.
I am currently experiencing the same issue. Offline decoding is fine but any form of streaming using
OnlineDenseIntersecter
increases WER by an unreasonable amount with almost all new errors coming from deletions.
OK, I am debuging it.
Any updates @pkufool?
Any updates @pkufool?
Sorry, I did not fix it at that day and forgot it, will return to it.
@svandiekendialpad @chiendb97 Does the differences only happens when using --use_ctc_decoding=false
(i.e decoding with an ngram) ?
Hi @pkufool, I just ran tests again using librispeech conformer ctc, here is the result:
- Using
--use_ctc_decoding=true
, I got WER=7.3%. - Using offline ctc decoding in
ctc_decode.cu
, I got WER=2.6%.
So I think there is still a significant difference between online and offline implementations regardless of using n-gram (though the gap is smaller).
I can confirm what @binhtranmcs said. It all points to a bug in the online decoding code.
@binhtranmcs I think https://github.com/k2-fsa/k2/pull/1218 solve some problems, but it still has differences between the lattices generated by online & offline mode, now I know it relates to the pruning, I am trying to fix it.
@danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.
Does the backward pass start with -(forward score) on all active states? That's how it is supposed to work.
On Tue, Jul 11, 2023, 10:20 AM Wei Kang @.***> wrote:
@danpovey https://github.com/danpovey I think one issue is: for offline mode the forward pass always run before the backward pass (i.e. when we expand the arcs at step t, the frames[t] has not beed pruned by backward pass), but in current online implementation, when we expand at step t (t is the last frame of previous chunk) frame[t] has been pruned by backward pass in the previous chunk. This is the only difference I found after reading the code carefully.
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1630370874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2RPNA2A6ULSP7WKH3XPUEF5ANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>
Hi @danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.
I think it is described in my paper about exact lattices.. or at least mentioned there. Pruned viterbi beam search with some extensions to store a lattice. The guys have discovered the problem but IDK if they have made the fix public yet.
On Sun, Jul 16, 2023, 5:56 PM binhtranmcs @.***> wrote:
Hi @danpovey https://github.com/danpovey, as I want to understand the code, can you please provide me some references to the online/offline decoding algorithm you implemented here. Since I am pretty new to this, it would really help a lot. Thanks in advance.
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1194#issuecomment-1637040507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7J4QLWBIEAYHIIKDLXQO3DVANCNFSM6AAAAAAX6JHOAI . You are receiving this because you were mentioned.Message ID: @.***>
@binhtranmcs @svandiekendialpad @chiendb97 I think https://github.com/k2-fsa/k2/pull/1218 can fix this issue, you can try it on your dataset.
@pkufool, I just tested again with librispeech conformer ctc, using online_decode.cu
:
- With
--use_ctc_decoding=true
, WER=7.3%. - With
--use_ctc_decoding=false
, WER=12.2%.
WER for online hlg decoding did decrease(from 18% down to 12%) but it is not as good as offline decoding(3.49%). I think there are still problems here.
For me it went up from 33% to 45%, when 14% should be normal. Should I have used allow_partial
anywhere? I just left it at its default (true in OnlineDenseIntersecter).
@binhtranmcs @svandiekendialpad OK, I just tested some bad cases, will test the full test datasets.
Hi @pkufool, are there any updates on this???
I think #1218 may be relevant to this. Not merged yet but says it is ready.
I think #1218 may be relevant to this. Not merged yet but says it is ready.
It a pity that the fixes in #1218 can not fix all the issue, I am still debuging it.
I did some exps on librispeech test-clean, here is the results: For ctc-decoding (decode with a ctc-topology),after applying the fixes in #1218 I can get almost the same WERs for online and offline.
Offline | Online (chunk=10) | |
---|---|---|
Ctc-decoding | 2.99 | 2.92 |
For hlg decoding (decode with an HLG), there are still big difference between online and offline, mainly the deletions at the tail of sentences.
Offline | Online(chunk=10) | Online(chunk=30) | Online(chunk=50) | Online(chunk=30)decoding_graph.scores = 0.0 | |
---|---|---|---|---|---|
Hlg decoding | 2.77 | 19.06 | 6.93 | 5.13 | 3.02 |
I believe this is the issue of pruning at the boundary frames (as I mentioned above). When I set the output_beam
(used in backward pruning) the same as the search_beam
(used in forward pruning) I can get the same results.
Offline | Online(chunk=10) | Online(chunk=10) output-beam=search-beam | |
---|---|---|---|
Hlg decoding | 2.77 | 19.06 | 2.73 |
I need to revisit the implementation carefully to figure out the fixes for this issue, for now I think you can try using the same output_beam
and search_beam
.
[edit:] BTW, I add the python test code in #1218 online_decode.py
and hlg_decode.py
which accept a wav scp, then you can use simple-wer to calculate the WERs.
@pkufool this makes me think that the backward scores have not been initialized correctly. They are supposed to be set to -(the forward score) when we do "intermediate" pruning (i.e. pruning not at the end of the file). If that is done, it should be OK to prune using "output_beam". I suspect that something is not working right in this respect: for example, they are not being set to that value, or they are being overwritten somehow, or something to do with a final-state is not correct.
@binhtranmcs @svandiekendialpad @chiendb97 I update #1218 I think this time it should be able to fix your issue.