sherpa-onnx
sherpa-onnx copied to clipboard
Zipformer CTC HLG Decode to Words
Hi. I have a phoneme-based Zipformer model, and I have successfully integrated a HLG component which takes the phonemes and converts it to a sequence of words. I've tried the icefall script here and it worked great.
For instance:
raw phoneme prediction: hɝɹɛdəmbɹɛlləɪzdʒʌstθbɛsst
+ HL: high err read am brigh ella i s just the best
+ HLG: her red umbrella is just the best
I am interested in doing HLG decoding in sherpa-onnx and saw that it has been supported through this PR.
I tried running my model via the example Python script, but found that it still only decoded the phoneme streams after the HLG decoding. E.g. I'm getting this as an output
həɹɛdʌmbɹɛləɪzdʒʌstθbɛst
I assume that there is a slight difference in the icefall onnx_pretrained_ctc_HLG_streaming.py
script and the sherpa-onnx code.
I see that in icefall it decodes the osymbols_out
through word_table
:
ok, best_path = decoder.get_best_path()
(
ok,
isymbols_out,
osymbols_out,
total_weight,
) = kaldifst.get_linear_symbol_sequence(best_path)
if not ok:
logging.info(f"Failed to get linear symbol sequence for {args.sound_file}")
return
hyps = " ".join([word_table[i] for i in osymbols_out]).lower()
which correctly decodes the words. But on the sherpa-onnx source code, it instead decodes isymbols_out
through token_table
and therefore decodes the phonemes (and not words):
bool ok = decoder->GetBestPath(&fst_out);
if (ok) {
std::vector<int32_t> isymbols_out;
std::vector<int32_t> osymbols_out_unused;
ok = fst::GetLinearSymbolSequence(fst_out, &isymbols_out,
&osymbols_out_unused, nullptr);
std::vector<int64_t> tokens;
tokens.reserve(isymbols_out.size());
std::vector<int32_t> timestamps;
timestamps.reserve(isymbols_out.size());
std::ostringstream os;
int32_t prev_id = -1;
int32_t num_trailing_blanks = 0;
int32_t f = 0; // frame number
for (auto i : isymbols_out) {
....
}
....
I was wondering if we can get API support for osymbols_out
, which in my case represents the decoded words? I think this can be beneficial for other users who have a similar phoneme-based ASR and yet would like to decode back to words. BPE-based models might not have this issue though, since decoding BPE will give you back the words anyway.
Thanks in advance.
Would you like to contribute?
The current code about HLG decoding handles only BPE models.
@csukuangfj
I probably don't have the capability to contribute this feature, sorry. It seems like there are heaps of stuff to go through, e.g. having to support words.txt
, where to store osymbols_out
vs isymbols_out
results, etc.
I think it's best for the sherpa-onnx team to work on this for the best result 😅
Hi @csukuangfj, is it possible to implement this, perhaps sometime soon, if possible? 🙏 I think it could be a good feature to have.
Is that possible that you share a model that can be used for testing?
@csukuangfj, sorry, but I can't share the model since it's proprietary.
Is it possible to instead use the demo model in this PR? I understand that it's still going to be BPE -> BPE. But perhaps we can still use the ctc-*.onnx
, words.txt
, and HLG.fst
to test. Also, we can cross-check with the results from Icefall.
ok, then please help test it if it is implemented. will try to add it this week or the next week.
Sure thing. Thank you so much, @csukuangfj!