open_model_zoo icon indicating copy to clipboard operation
open_model_zoo copied to clipboard

How to convert the .scorer LM model to a character base .kenlm?

Open FengYen-Chang opened this issue 3 years ago • 5 comments

Hi All,

Currently, the speech recognition is supporting the vocabulary base .kenlm model which is generated by scorer_to_kenlm.py. However, the scorer_to_kenlm.py seems is only supporting the vocabulary base. So, it means this sample does not support the character output w/ LM if the users cannot convert a character base LM.

For the CTC decoder, I am thinking it is supporting the character output as this part of code and it will detect the base of LM.

Therefore, I think the key point to enable the character base support is LM. To make it supports the character output, does it has any method to convert this .scorer file to a character base .kenlm file base on the scorer_to_kenlm.py or w/ another script?

FengYen-Chang avatar Mar 25 '21 15:03 FengYen-Chang

@AlexeyKruglov could you please comment on this?

vladimir-dudnik avatar Mar 26 '21 08:03 vladimir-dudnik

Hi. Thank you for your contribution to OMZ.

Conversion of "character-based" models is currently not supported because it was not tested and debugged. File parse_trie_v6.py contains a check that intentionally fails for "character-based" models because of this (find if is_utf8_mode: in that file). ("Character-based" mode is in fact not character-based, but byte-based in UTF-8 encoding, that is actually in this mode LM alphabet contains sub-character symbols -- that's why it was called "utf8_mode" in the converter.)

AlexeyKruglov avatar Mar 29 '21 14:03 AlexeyKruglov

Hi @AlexeyKruglov Thanks for your comment.

Yes, I noticed it in parse_trie_v6.py and I am trying to enable is_utf8_mode based on the parse_trie_v6.py to enable the Chinese support, however, I still study .kenlm file and try to convert the .scorer to .kenlm. If you have any recommendation on this, please let me know. Thank you.

FengYen-Chang avatar Mar 29 '21 15:03 FengYen-Chang

More details. (I will call the "character-based" mode the "UTF8-byte mode" below.)

  1. The conversion utility (scorer_to_kenlm.py) seems to work in UTF8-byte mode with two changes: 1) removing that check with if is_utf8_mode:, 2) using --no-drop-space option, and 3) one needs to pass proper alphabet argument to trie_v6_extract_vocabulary() function from scorer_to_kenlm.py. Like this:

         parser.add_argument('--trie-offset', type=int, default=None, help="TRIE section offset (optional)")
         parser.add_argument('--no-drop-space', action='store_true',
                             help="Don't remove space at the end of each vocabulary word")
    +    parser.add_argument('--alphabet-utf8', action='store_true',
    +                        help="Use alphabet of 255 non-0 8-bit chars for UTF-8 mode")
         return parser.parse_args()
    
    
     def main():
         args = parse_args()
    +    alphabet = None if not args.alphabet_utf8 else [bytes((i,)) for i in range(1, 256)]
    
         data_scorer = args.input.read_bytes()
    
         data_scorer, data_trie, trie_offset = scorer_cut_trie_v6(data_scorer, trie_offset=args.trie_offset)
    -    vocabulary, metadata = trie_v6_extract_vocabulary(data_trie, base_offset=trie_offset)
    +    vocabulary, metadata = trie_v6_extract_vocabulary(data_trie, base_offset=trie_offset, alphabet=alphabet)
         data_scorer, vocab_offset = kenlm_v5_insert_vocabulary(data_scorer, vocabulary,
                                                                drop_final_spaces=not args.no_drop_space)
    

    Currently the converter actually hardcodes the alphabet. A better way would be to read the alphabet from the input files. But Mozilla DeepSpeech doesn't store the alphabet in .scorer file, instead it is stored inside .pbmm file (or in .pb file after pbmm_to_pb.py) in metadata_alphabet field.

    Just converting the scorer to the old 0.6.x (kenlm) format is probably not enough. While C++ backend should probably already support the UTF8-byte mode, Python side may need some changes. I'll try to list the probable changes below.

  2. One problem is properly passing the correct alphabet from the outer (demo app) to the inner (SWIG bindings, C++ code etc) wrapper layers. Currently the alphabet type is list(str), but in UTF8-byte mode it should be list(bytes) like in the example above. Some kind of control logic needs to be implemented to activate this mode. Plus change in data type may induce some changes in the code (it must support both str and bytes alphabets).

  3. Another problem is that C++ backend returns strings as arrays of numeric indices into alphabet. Python side converts it into a readable string. Currently it decodes it into str like this (inside alphabet.py in class CtcdecoderAlphabet):

        def decode(self, keys):
            return ''.join(self.characters[key] for key in keys)
    

    In UTF8-bytes mode, it needs to concatenate bytes (8-bit strings) (by replacing '' with bytes() or b'' or even type(self.characters[0])()), and then decode UTF-8 with something like result.decode('utf-8') (but with proper options to handle the possible invalid UTF-8 sequences). This can be done in a new separate alphabet class, for example. By the way, internals of the alphabet class are accessed in ctcnumpy_beam_search_decoder.py file, find alphabet.characters + [''] there. That is, characters field/property of alphabet class must be exposed.

AlexeyKruglov avatar Mar 30 '21 13:03 AlexeyKruglov

Thanks @AlexeyKruglov

Below are my reply for each point.

  1. Yes, I also patched the parse_trie_v6.py in this recently, and the patch is quite same as you shared. And I also add this patch into parse_trie_v6.py to disable the vocabulary-based. By this, the CTC decoder will treate the .kenlm file as character-based

         vocab_offset = len(data_kenlm)
    -    data_kenlm = [data_kenlm[:with_vocab_offset], b'\1', data_kenlm[with_vocab_offset + 1:]]
    +    data_kenlm = [data_kenlm[:with_vocab_offset], b'\0', data_kenlm[with_vocab_offset + 1:]]
         data_kenlm.append(convert_vocabulary_to_kenlm_format(vocabulary, drop_final_spaces=drop_final_spaces))
    

    However, the result is still quite strange, such as below.

  • w/o LM

    5.361588001251221	e5b88fe799bee4ba94e58d81e4baba
    5.381840229034424	e5b88fe799bee4ba94e58d81e4ba83ba
    5.4454755783081055	e59c8fe799bee4ba94e58d81e4baba
    

    I think the converted IR is correct, as the result is same as original deep speech w/ Chinese model (v0.9.3). However, I cannot confirm the performance of this IR as this Chinese-based model is train by their internal dataset.

  • w/ LM

    839.2965698242188	b8
    839.3804931640625	9c
    839.5372314453125	85
    

    Based on those different, I am guessing the converted .kenlm is incorrect, but I am not sure still checking it.

    To clarify the difference between deep speech v0.8.2 and v0.9.3, I also do the test for English-based model. And those result is same, so I think the scorer do not has the big difference between this two release.

  1. Yes, but I am thinking if I keep it as list(str) is better as the input of ctcdecoder for labels is std::vector<std::string>.

    std::vector<std::pair<float, Output>> ctc_beam_search_decoder(
        const std::vector<std::vector<float>> &probs_seq,
        const std::vector<std::string> &vocabulary,
        size_t beam_size,
        float cutoff_prob,
        size_t cutoff_top_n,
        size_t blank_id,
        int log_input,
        ScorerBase *ext_scorer) {
    
  2. Yes, I will do that after the output is correct.

FengYen-Chang avatar Mar 31 '21 08:03 FengYen-Chang