sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Zipformer CTC HLG Decode to Words

Open w11wo opened this issue 2 months ago • 7 comments

Hi. I have a phoneme-based Zipformer model, and I have successfully integrated a HLG component which takes the phonemes and converts it to a sequence of words. I've tried the icefall script here and it worked great.

For instance:

raw phoneme prediction: hɝɹɛdəmbɹɛlləɪzdʒʌstθbɛsst
+ HL: high err read am brigh ella i s just the best
+ HLG: her red umbrella is just the best

I am interested in doing HLG decoding in sherpa-onnx and saw that it has been supported through this PR.

I tried running my model via the example Python script, but found that it still only decoded the phoneme streams after the HLG decoding. E.g. I'm getting this as an output

həɹɛdʌmbɹɛləɪzdʒʌstθbɛst

I assume that there is a slight difference in the icefall onnx_pretrained_ctc_HLG_streaming.py script and the sherpa-onnx code.

I see that in icefall it decodes the osymbols_out through word_table:

    ok, best_path = decoder.get_best_path()

    (
        ok,
        isymbols_out,
        osymbols_out,
        total_weight,
    ) = kaldifst.get_linear_symbol_sequence(best_path)

    if not ok:
        logging.info(f"Failed to get linear symbol sequence for {args.sound_file}")
        return

    hyps = " ".join([word_table[i] for i in osymbols_out]).lower()

which correctly decodes the words. But on the sherpa-onnx source code, it instead decodes isymbols_out through token_table and therefore decodes the phonemes (and not words):

    bool ok = decoder->GetBestPath(&fst_out);
    if (ok) {
      std::vector<int32_t> isymbols_out;
      std::vector<int32_t> osymbols_out_unused;
      ok = fst::GetLinearSymbolSequence(fst_out, &isymbols_out,
                                        &osymbols_out_unused, nullptr);
      std::vector<int64_t> tokens;
      tokens.reserve(isymbols_out.size());

      std::vector<int32_t> timestamps;
      timestamps.reserve(isymbols_out.size());

      std::ostringstream os;
      int32_t prev_id = -1;
      int32_t num_trailing_blanks = 0;
      int32_t f = 0;  // frame number

      for (auto i : isymbols_out) {
        ....
      }
      
     ....

I was wondering if we can get API support for osymbols_out, which in my case represents the decoded words? I think this can be beneficial for other users who have a similar phoneme-based ASR and yet would like to decode back to words. BPE-based models might not have this issue though, since decoding BPE will give you back the words anyway.

Thanks in advance.

w11wo avatar Apr 24 '24 08:04 w11wo

Would you like to contribute?

The current code about HLG decoding handles only BPE models.

csukuangfj avatar Apr 24 '24 13:04 csukuangfj

@csukuangfj

I probably don't have the capability to contribute this feature, sorry. It seems like there are heaps of stuff to go through, e.g. having to support words.txt, where to store osymbols_out vs isymbols_out results, etc.

I think it's best for the sherpa-onnx team to work on this for the best result 😅

w11wo avatar Apr 25 '24 03:04 w11wo