kenlm interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM

trafficstars

I have tried those commands for converting both to intermediate state to start interpolation of both language models

bin/lmplz -o 3  --intermediate set1.intermediate <lm.binary --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate <data.arpa --skip_symbols
bin/interpolate -m set{1,2}.intermediate -w 0.5 0.5 >model.arpa

But it gives an ARPA lm with hashed words which seems due to the binary lm Is there any way to do the interpolation between two different formats?

Output looks like:

-7.998156 0 -7.995031 5æfkÕROc¬ÇJáÐ¯:0mJWIB#N2Ú?/CÞ|pMFÖõš!uÃôq0thÜv7×fŒŸÔa+z¥Ãp[ÖD£3ò~i8Íâ_JBO -0.0000030237939 -7.995031 =õô*M -0.0000030237939 -7.995031 æUüÿgño®ó4þY¿AÇ¿ùø[êø7Âx{ -0.0000030237939 -7.995031 ždp(µ€ÎŽpZàëKü^|wIÁKö. -0.0000030237939 -7.995031 j0mzâFÅ¢$ÊÈ!e0$2²A

Aug 03 '22 00:08 MohamedElrefai

I have a different problem by doing a similar thing.

bin/lmplz -o 3  --intermediate set1.intermediate < model1.arpa --skip_symbols
bin/lmplz -o 3  --intermediate set2.intermediate < model2.arpa --skip_symbols --discount_fallback
bin/interpolate -m set{1,2}.intermediate -w 0.9 0.1 > mix_model.arpa

and here is the mix_model.arpa looks like this:

\data\
ngram 1=644829
ngram 2=12670052
ngram 3=25758022

\1-grams:
-54.518303      <unk>   0
-52.52658       -0.0171115      -13.50311
-52.52658       -0.00951482     -13.50311
-53.48088       -0.0324709      -5.2374773
-52.52658       -0.0749187      -13.50311
-53.48088       -0.144883       -5.2374773
-40.502262      PAUSED  -16.844946
-53.48088       -0.0970763      -5.2374773
-52.52658       -0.0147354      -13.50311
-52.52658       -0.0316612      -13.50311
-52.52658       -0.0027085      -13.50311
-53.48088       -0.0751489      -5.2374773
-53.48088       -4.1396 -2.1953142e-9
-53.48088       -0.00477613     -5.2374773
-53.48088       -0.0946923      -5.2374773
-53.48088       -0.0940779      -5.2374773
-53.48088       -0.0175576      -5.2374773
-53.48088       -0.0906019      -5.2374773
-53.48088       -0.00636295     -5.2374773
-53.48088       -0.00901661     -5.2374773
-53.48088       -0.241633       -5.2374773
-52.52658       -0.0709945      -13.50311
-53.48088       -0.0280209      -5.2374773
-52.52658       -0.101508       -13.50311
-53.48088       -0.0970096      -5.2374773
-53.48088       -0.0207892      -5.2374773
-39.809536      LOANING -9.177942
-53.48088       -0.0223462      -5.2374773
-53.48088       -5.2487 -6.8420647e-9
-53.48088       -0.0525024      -5.2374773
-53.48088       -0.0678001      -5.2374773
-52.52658       -0.0822201      -13.50311
-52.52658       -0.0873803      -13.50311
-53.48088       -2.0888 -1.8957937e-9
-53.48088       -0.0926938      -5.2374773
-52.52658       -0.143922       -13.50311
-53.48088       -6.5022 -4.0860257e-7
-53.48088       -0.215156       -5.2374773
-51.61946       -0.104799       -12.890841
-53.48088       -2.7527 -2.0234303e-9
-53.48088       -0.0728444      -5.2374773
-52.52658       -0.0054465      -13.50311
-44.464344      BERNICE -15.216694
-53.48088       -0.0780277      -5.2374773
-51.61946       -0.0782491      -12.890841
.
.
.

What are these lines with numbers only? I do the same thing with SRILM with the command below and I have a meaningful ARMA model.

ngram -lm model1.arpa -mix-lm model2.arpa -lambda 0.1 -write-lm mix_model.arpa

Oct 26 '22 16:10 saeidmokaram

Also to mention the mix_model.arpa made with KenLM is significantly larger than the model1.arpa and model2.arpa. model1.arpa = 216.5 MB model2.arpa = 78.8 MB mix_model.arpa = 1.3 GB

Model mixed using SRILM is : mix_model.arpa = 223.9 MB

Any idea @kpu about this issue?

Oct 28 '22 10:10 saeidmokaram

kenlm kenlm copied to clipboard

interpolate LM created with Kenlm with binary format and arpa format with weight 0.5 and 0.5 for each LM

kenlm
kenlm copied to clipboard