kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

Merging ARPA lms

Open Break-Neck opened this issue 7 years ago • 3 comments

Hi, could you please tell me, can I merge a few large lms in ARPA format using KenLM? I looked through existing issues, but couldn't find an answer:

  • In #146 you said it can be done, but options for interpolate say "which must be in KenLM intermediate format". I tried to use it on ARPA or binary formats from build_binary, but they fail reading a file with ".kenlm_intermediate" extension, and I can't see any way to pass ARPA lms from the code.
  • And in #62 you said that you were going to kill off the intermediate format, but that was 2 years ago.

So could you please clarify it? I haven't found any other good tool for automatic log-linear merge weights fitting on a corpus. Thank you for your project!

Break-Neck avatar Nov 13 '18 16:11 Break-Neck

Interpolation was developed around the time neural networks took over, so it has rough edges. So currently the interpolation tool only knows how to take intermediate format and there isn't an ARPA->intermediate tool but one could make one.
The intermediate format is relatively simple. Separate files for each order containing records for n-grams. Each record is an array of 32-bit vocab ids, 32-bit float log10 probability, and 32-bit float log10 backoff (except highest order doesn't have backoff). Files are sorted in suffix order. Unknown must be id 0. And a small piece of metadata about order that you can see in the examples generated by lmplz. Vocabulary is a separate file with strings in order null-delimited.

kpu avatar Nov 15 '18 11:11 kpu

Did somebody looked into it and implemented such a tool? I am also very much interested in interpolating with an existing ARPA LM.

sarahberanek avatar Mar 17 '21 19:03 sarahberanek

Is there a tool for converting .arpa file to intermediate file that you suggest?

khoanguyenvietmanh avatar Aug 06 '21 14:08 khoanguyenvietmanh