sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Hotwords encoding for phonemes

Open w11wo opened this issue 1 year ago • 9 comments

Hi. I have a phoneme-based Zipformer model.

Before this PR, I was able to apply hotwords encoding for phoneme sequences, e.g. ɪ z/dʒ ʌ s t/b ɛ s t, following the older implementation of e.g. Chinese character hotwords encoding. But now, I noticed that the Chinese character hotwords encoding have changed from 深 度 学 习 (whitespace between chars) to 深度学习 (no whitespace). And I assume the string parser will simply iterate through the non-whitespace characters in the string sequence.

This, however, breaks my use case, since phoneme sequence with digraphs, e.g. dʒ ʌ s t will be incorrectly split to d ʒ ʌ s t. The issue is that my model's vocab supports digraph and requires the old implementation.

Is it possible to add another modeling unit, other than the currently supported ones (cjk, BPE, cjk+BPE)? Maybe instead of iterating for every non-whitespace character, split by whitespace first? This new modeling unit can hopefully support other use cases similar to mine.

Massive thanks for all the work and help thus far!

w11wo avatar Jun 06 '24 10:06 w11wo

@pkufool Could you have a look?

csukuangfj avatar Jun 06 '24 10:06 csukuangfj

Emm, May be I we can add a option like do-not-tokenize, I think it should fix your issue.

pkufool avatar Jun 07 '24 04:06 pkufool

For now, I think you can use the older version, v1.9.24 .

pkufool avatar Jun 07 '24 04:06 pkufool

@pkufool Yes, the do-not-tokenize option sounds good.

I can stick to older versions for now, but I wanted to try the customizable per-word hotwords scores, which comes in the latest releases only, hence the need for this new feature.

w11wo avatar Jun 07 '24 04:06 w11wo

@w11wo OK, will make a PR.

pkufool avatar Jun 07 '24 09:06 pkufool

Hi @pkufool, is there an update on the PR?

w11wo avatar Jun 18 '24 05:06 w11wo

Hi @pkufool, is there an update on the PR?

There is an on-going PR https://github.com/k2-fsa/sherpa-onnx/pull/1039

pkufool avatar Jun 21 '24 08:06 pkufool

Thank you so much @pkufool. Looking forward to it getting merged.

w11wo avatar Jun 21 '24 08:06 w11wo

Hi @pkufool , I'm encountering the same issue as @w11wo . I was wondering if PR #1039, which enables pre-tokenize hotwords, will be merged into the master branch? Thank you! 🙏

DavidSamuell avatar Feb 10 '25 08:02 DavidSamuell