sherpa-onnx Zipformer CTC HLG Decode to Words

Hi. I have a phoneme-based Zipformer model, and I have successfully integrated a HLG component which takes the phonemes and converts it to a sequence of words. I've tried the icefall script here and it worked great.

For instance:

raw phoneme prediction: hɝɹɛdəmbɹɛlləɪzdʒʌstθbɛsst
+ HL: high err read am brigh ella i s just the best
+ HLG: her red umbrella is just the best

I am interested in doing HLG decoding in sherpa-onnx and saw that it has been supported through this PR.

I tried running my model via the example Python script, but found that it still only decoded the phoneme streams after the HLG decoding. E.g. I'm getting this as an output

həɹɛdʌmbɹɛləɪzdʒʌstθbɛst

I assume that there is a slight difference in the icefall onnx_pretrained_ctc_HLG_streaming.py script and the sherpa-onnx code.

I see that in icefall it decodes the osymbols_out through word_table:

    ok, best_path = decoder.get_best_path()

    (
        ok,
        isymbols_out,
        osymbols_out,
        total_weight,
    ) = kaldifst.get_linear_symbol_sequence(best_path)

    if not ok:
        logging.info(f"Failed to get linear symbol sequence for {args.sound_file}")
        return

    hyps = " ".join([word_table[i] for i in osymbols_out]).lower()

which correctly decodes the words. But on the sherpa-onnx source code, it instead decodes isymbols_out through token_table and therefore decodes the phonemes (and not words):

    bool ok = decoder->GetBestPath(&fst_out);
    if (ok) {
      std::vector<int32_t> isymbols_out;
      std::vector<int32_t> osymbols_out_unused;
      ok = fst::GetLinearSymbolSequence(fst_out, &isymbols_out,
                                        &osymbols_out_unused, nullptr);
      std::vector<int64_t> tokens;
      tokens.reserve(isymbols_out.size());

      std::vector<int32_t> timestamps;
      timestamps.reserve(isymbols_out.size());

      std::ostringstream os;
      int32_t prev_id = -1;
      int32_t num_trailing_blanks = 0;
      int32_t f = 0;  // frame number

      for (auto i : isymbols_out) {
        ....
      }
      
     ....

I was wondering if we can get API support for osymbols_out, which in my case represents the decoded words? I think this can be beneficial for other users who have a similar phoneme-based ASR and yet would like to decode back to words. BPE-based models might not have this issue though, since decoding BPE will give you back the words anyway.

Thanks in advance.

Apr 24 '24 08:04 w11wo

Would you like to contribute?

The current code about HLG decoding handles only BPE models.

Apr 24 '24 13:04 csukuangfj

@csukuangfj

I probably don't have the capability to contribute this feature, sorry. It seems like there are heaps of stuff to go through, e.g. having to support words.txt, where to store osymbols_out vs isymbols_out results, etc.

I think it's best for the sherpa-onnx team to work on this for the best result 😅

Apr 25 '24 03:04 w11wo

Hi @csukuangfj, is it possible to implement this, perhaps sometime soon, if possible? 🙏 I think it could be a good feature to have.

May 17 '24 08:05 w11wo

Is that possible that you share a model that can be used for testing?

May 17 '24 08:05 csukuangfj

@csukuangfj, sorry, but I can't share the model since it's proprietary.

Is it possible to instead use the demo model in this PR? I understand that it's still going to be BPE -> BPE. But perhaps we can still use the ctc-*.onnx, words.txt, and HLG.fst to test. Also, we can cross-check with the results from Icefall.

May 17 '24 08:05 w11wo

ok, then please help test it if it is implemented. will try to add it this week or the next week.

May 17 '24 08:05 csukuangfj

Sure thing. Thank you so much, @csukuangfj!

May 17 '24 08:05 w11wo

sherpa-onnx sherpa-onnx copied to clipboard

Zipformer CTC HLG Decode to Words

sherpa-onnx
sherpa-onnx copied to clipboard