Merging ARPA lms
Hi, could you please tell me, can I merge a few large lms in ARPA format using KenLM? I looked through existing issues, but couldn't find an answer:
- In #146 you said it can be done, but options for interpolate say "which must be in KenLM intermediate format". I tried to use it on ARPA or binary formats from build_binary, but they fail reading a file with ".kenlm_intermediate" extension, and I can't see any way to pass ARPA lms from the code.
- And in #62 you said that you were going to kill off the intermediate format, but that was 2 years ago.
So could you please clarify it? I haven't found any other good tool for automatic log-linear merge weights fitting on a corpus. Thank you for your project!
Interpolation was developed around the time neural networks took over, so it has rough edges. So currently the interpolation tool only knows how to take intermediate format and there isn't an ARPA->intermediate tool but one could make one.
The intermediate format is relatively simple. Separate files for each order containing records for n-grams. Each record is an array of 32-bit vocab ids, 32-bit float log10 probability, and 32-bit float log10 backoff (except highest order doesn't have backoff). Files are sorted in suffix order. Unknown must be id 0. And a small piece of metadata about order that you can see in the examples generated by lmplz. Vocabulary is a separate file with strings in order null-delimited.
Did somebody looked into it and implemented such a tool? I am also very much interested in interpolating with an existing ARPA LM.
Is there a tool for converting .arpa file to intermediate file that you suggest?