open_model_zoo
open_model_zoo copied to clipboard
How to convert the .scorer LM model to a character base .kenlm?
Hi All,
Currently, the speech recognition is supporting the vocabulary base .kenlm
model which is generated by scorer_to_kenlm.py
. However, the scorer_to_kenlm.py
seems is only supporting the vocabulary base. So, it means this sample does not support the character output w/ LM if the users cannot convert a character base LM.
For the CTC decoder, I am thinking it is supporting the character output as this part of code and it will detect the base of LM.
Therefore, I think the key point to enable the character base support is LM. To make it supports the character output, does it has any method to convert this .scorer
file to a character base .kenlm
file base on the scorer_to_kenlm.py
or w/ another script?
@AlexeyKruglov could you please comment on this?
Hi. Thank you for your contribution to OMZ.
Conversion of "character-based" models is currently not supported because it was not tested and debugged. File parse_trie_v6.py
contains a check that intentionally fails for "character-based" models because of this (find if is_utf8_mode:
in that file). ("Character-based" mode is in fact not character-based, but byte-based in UTF-8 encoding, that is actually in this mode LM alphabet contains sub-character symbols -- that's why it was called "utf8_mode" in the converter.)
Hi @AlexeyKruglov Thanks for your comment.
Yes, I noticed it in parse_trie_v6.py
and I am trying to enable is_utf8_mode
based on the parse_trie_v6.py
to enable the Chinese support, however, I still study .kenlm
file and try to convert the .scorer
to .kenlm
.
If you have any recommendation on this, please let me know. Thank you.
More details. (I will call the "character-based" mode the "UTF8-byte mode" below.)
-
The conversion utility (
scorer_to_kenlm.py
) seems to work in UTF8-byte mode with two changes: 1) removing that check withif is_utf8_mode:
, 2) using--no-drop-space
option, and 3) one needs to pass properalphabet
argument totrie_v6_extract_vocabulary()
function fromscorer_to_kenlm.py
. Like this:parser.add_argument('--trie-offset', type=int, default=None, help="TRIE section offset (optional)") parser.add_argument('--no-drop-space', action='store_true', help="Don't remove space at the end of each vocabulary word") + parser.add_argument('--alphabet-utf8', action='store_true', + help="Use alphabet of 255 non-0 8-bit chars for UTF-8 mode") return parser.parse_args() def main(): args = parse_args() + alphabet = None if not args.alphabet_utf8 else [bytes((i,)) for i in range(1, 256)] data_scorer = args.input.read_bytes() data_scorer, data_trie, trie_offset = scorer_cut_trie_v6(data_scorer, trie_offset=args.trie_offset) - vocabulary, metadata = trie_v6_extract_vocabulary(data_trie, base_offset=trie_offset) + vocabulary, metadata = trie_v6_extract_vocabulary(data_trie, base_offset=trie_offset, alphabet=alphabet) data_scorer, vocab_offset = kenlm_v5_insert_vocabulary(data_scorer, vocabulary, drop_final_spaces=not args.no_drop_space)
Currently the converter actually hardcodes the alphabet. A better way would be to read the alphabet from the input files. But Mozilla DeepSpeech doesn't store the alphabet in
.scorer
file, instead it is stored inside.pbmm
file (or in.pb
file afterpbmm_to_pb.py
) inmetadata_alphabet
field.Just converting the scorer to the old 0.6.x (kenlm) format is probably not enough. While C++ backend should probably already support the UTF8-byte mode, Python side may need some changes. I'll try to list the probable changes below.
-
One problem is properly passing the correct alphabet from the outer (demo app) to the inner (SWIG bindings, C++ code etc) wrapper layers. Currently the alphabet type is list(str), but in UTF8-byte mode it should be list(bytes) like in the example above. Some kind of control logic needs to be implemented to activate this mode. Plus change in data type may induce some changes in the code (it must support both
str
andbytes
alphabets). -
Another problem is that C++ backend returns strings as arrays of numeric indices into alphabet. Python side converts it into a readable string. Currently it decodes it into
str
like this (insidealphabet.py
in classCtcdecoderAlphabet
):def decode(self, keys): return ''.join(self.characters[key] for key in keys)
In UTF8-bytes mode, it needs to concatenate
bytes
(8-bit strings) (by replacing''
withbytes()
orb''
or eventype(self.characters[0])()
), and then decode UTF-8 with something like result.decode('utf-8') (but with proper options to handle the possible invalid UTF-8 sequences). This can be done in a new separate alphabet class, for example. By the way, internals of the alphabet class are accessed inctcnumpy_beam_search_decoder.py
file, findalphabet.characters + ['']
there. That is,characters
field/property of alphabet class must be exposed.
Thanks @AlexeyKruglov
Below are my reply for each point.
-
Yes, I also patched the
parse_trie_v6.py
in this recently, and the patch is quite same as you shared. And I also add this patch intoparse_trie_v6.py
to disable the vocabulary-based. By this, the CTC decoder will treate the.kenlm
file as character-basedvocab_offset = len(data_kenlm) - data_kenlm = [data_kenlm[:with_vocab_offset], b'\1', data_kenlm[with_vocab_offset + 1:]] + data_kenlm = [data_kenlm[:with_vocab_offset], b'\0', data_kenlm[with_vocab_offset + 1:]] data_kenlm.append(convert_vocabulary_to_kenlm_format(vocabulary, drop_final_spaces=drop_final_spaces))
However, the result is still quite strange, such as below.
-
w/o LM
5.361588001251221 e5b88fe799bee4ba94e58d81e4baba 5.381840229034424 e5b88fe799bee4ba94e58d81e4ba83ba 5.4454755783081055 e59c8fe799bee4ba94e58d81e4baba
I think the converted IR is correct, as the result is same as original deep speech w/ Chinese model (
v0.9.3
). However, I cannot confirm the performance of this IR as this Chinese-based model is train by their internal dataset. -
w/ LM
839.2965698242188 b8 839.3804931640625 9c 839.5372314453125 85
Based on those different, I am guessing the converted
.kenlm
is incorrect, but I am not sure still checking it.To clarify the difference between deep speech
v0.8.2
andv0.9.3
, I also do the test for English-based model. And those result is same, so I think thescorer
do not has the big difference between this two release.
-
Yes, but I am thinking if I keep it as
list(str)
is better as the input of ctcdecoder for labels isstd::vector<std::string>
.std::vector<std::pair<float, Output>> ctc_beam_search_decoder( const std::vector<std::vector<float>> &probs_seq, const std::vector<std::string> &vocabulary, size_t beam_size, float cutoff_prob, size_t cutoff_top_n, size_t blank_id, int log_input, ScorerBase *ext_scorer) {
-
Yes, I will do that after the output is correct.