kenlm
kenlm copied to clipboard
interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM
I have tried those commands for converting both to intermediate state to start interpolation of both language models
bin/lmplz -o 3 --intermediate set1.intermediate <lm.binary --skip_symbols
bin/lmplz -o 3 --intermediate set2.intermediate <data.arpa --skip_symbols
bin/interpolate -m set{1,2}.intermediate -w 0.5 0.5 >model.arpa
But it gives an ARPA lm with hashed words which seems due to the binary lm Is there any way to do the interpolation between two different formats?
Output looks like:
-7.998156
0 -7.995031 5æfkÕROc¬ÇJáЯ:0mJWIB#N2Ú?/CÞ|pMFÖõš!uÃôq0thÜv7×fŒŸÔa+z¥Ãp[ÖD£3ò~i8Íâ_ JBO -0.0000030237939 -7.995031 =õô*M -0.0000030237939 -7.995031 æUüÿgño®ó4þY¿AÇ¿ùø[êø7Âx{ -0.0000030237939 -7.995031 ždp(µ€ÎŽpZàëKü^|wIÁKö. -0.0000030237939 -7.995031 j0mzâFÅ¢$ÊÈ!e0$2²A
I have a different problem by doing a similar thing.
bin/lmplz -o 3 --intermediate set1.intermediate < model1.arpa --skip_symbols
bin/lmplz -o 3 --intermediate set2.intermediate < model2.arpa --skip_symbols --discount_fallback
bin/interpolate -m set{1,2}.intermediate -w 0.9 0.1 > mix_model.arpa
and here is the mix_model.arpa looks like this:
\data\
ngram 1=644829
ngram 2=12670052
ngram 3=25758022
\1-grams:
-54.518303 <unk> 0
-52.52658 -0.0171115 -13.50311
-52.52658 -0.00951482 -13.50311
-53.48088 -0.0324709 -5.2374773
-52.52658 -0.0749187 -13.50311
-53.48088 -0.144883 -5.2374773
-40.502262 PAUSED -16.844946
-53.48088 -0.0970763 -5.2374773
-52.52658 -0.0147354 -13.50311
-52.52658 -0.0316612 -13.50311
-52.52658 -0.0027085 -13.50311
-53.48088 -0.0751489 -5.2374773
-53.48088 -4.1396 -2.1953142e-9
-53.48088 -0.00477613 -5.2374773
-53.48088 -0.0946923 -5.2374773
-53.48088 -0.0940779 -5.2374773
-53.48088 -0.0175576 -5.2374773
-53.48088 -0.0906019 -5.2374773
-53.48088 -0.00636295 -5.2374773
-53.48088 -0.00901661 -5.2374773
-53.48088 -0.241633 -5.2374773
-52.52658 -0.0709945 -13.50311
-53.48088 -0.0280209 -5.2374773
-52.52658 -0.101508 -13.50311
-53.48088 -0.0970096 -5.2374773
-53.48088 -0.0207892 -5.2374773
-39.809536 LOANING -9.177942
-53.48088 -0.0223462 -5.2374773
-53.48088 -5.2487 -6.8420647e-9
-53.48088 -0.0525024 -5.2374773
-53.48088 -0.0678001 -5.2374773
-52.52658 -0.0822201 -13.50311
-52.52658 -0.0873803 -13.50311
-53.48088 -2.0888 -1.8957937e-9
-53.48088 -0.0926938 -5.2374773
-52.52658 -0.143922 -13.50311
-53.48088 -6.5022 -4.0860257e-7
-53.48088 -0.215156 -5.2374773
-51.61946 -0.104799 -12.890841
-53.48088 -2.7527 -2.0234303e-9
-53.48088 -0.0728444 -5.2374773
-52.52658 -0.0054465 -13.50311
-44.464344 BERNICE -15.216694
-53.48088 -0.0780277 -5.2374773
-51.61946 -0.0782491 -12.890841
.
.
.
What are these lines with numbers only? I do the same thing with SRILM with the command below and I have a meaningful ARMA model.
ngram -lm model1.arpa -mix-lm model2.arpa -lambda 0.1 -write-lm mix_model.arpa
Also to mention the mix_model.arpa made with KenLM is significantly larger than the model1.arpa and model2.arpa. model1.arpa = 216.5 MB model2.arpa = 78.8 MB mix_model.arpa = 1.3 GB
Model mixed using SRILM is : mix_model.arpa = 223.9 MB
Any idea @kpu about this issue?