kenlm Filtering: the format of the target vocabulary

Hi, thanks for this tool. I have a very large language model and want to filter it according to a target vocabulary, is there a specific format for the vocabulary?
If I have a test set, how to match that test set to the phrase table to produce a target vocabulary? About the usage of filtering, i'm confused with "bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file", how can I provide the vocab file and model file to the filter? looking forward to your answer, thanks.

Nov 16 '18 09:11 Amber819

Run the filter program. It will print a help message with more command line documentation.
bin/filter vocab model:in.arpa out.arpa <vocabulary.txt where vocabulary.txt is vocabulary words separated by space, horizontal tab, carriage return, or newline. It may contain duplicates.

Nov 16 '18 10:11 kpu

@kpu thanks a lot. Another question is that my language model is in trie data structure, how could i convert it to an arpa file(needed in the filter)? I saw the solution in https://github.com/kpu/kenlm/issues/121, but still don't know how to compile "dump_trie_main.cc" with bjam, could you provide the command lines ?

Nov 19 '18 07:11 Amber819

Using: bin/filter vocab:vocab.csv model:in.arpa out.arpa does not seem to work. Also tried the command mentioned above but that does not work either. What is the correct usage? @kpu

Mar 02 '20 10:03 praneetmehta

Use only one of vocab: or model: (the other is on stdin). Also, it's not a csv, it's whitespace-delimited tokens.

Mar 02 '20 22:03 kpu

cat vocab.txt | ../kenlm/build/bin/filter copy model:kenlm_out/input_model.arpa kenlm_out/vocab_filtered_model.arpa @kpu is this the correct usage then. TIA :)

Mar 03 '20 09:03 praneetmehta

Run the filter program. It will print a help message with more command line documentation. bin/filter vocab model:in.arpa out.arpa <vocabulary.txt where vocabulary.txt is vocabulary words separated by space, horizontal tab, carriage return, or newline. It may contain duplicates.

@kpu Thanks for your great works. Now I meet trouble with bin/filter. I used bin/filter vocab model:in.arpa out.arpa <vocabulary.txt, while it still cerr `Usage: bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file

copy mode just copies, but makes the format nicer for e.g. irstlm's broken parser. single mode treats the entire input as a single sentence. multiple mode filters to multiple sentences in parallel. Each sentence is on a separate line. A separate file is created for each sentence by appending the 0-indexed line number to the output file name. union mode produces one filtered model that is the union of models created by multiple mode.

context means only the context (all but last word) has to pass the filter, but the entire n-gram is output.

phrase means that the vocabulary is actually tab-delimited phrases and that the phrases can generate the n-gram when assembled in arbitrary order and clipped. Currently works with multiple or union mode.

The file format is set by [raw|arpa] with default arpa: raw means space-separated tokens, optionally followed by a tab and arbitrary text. This is useful for ngram count files. arpa means the ARPA file format for n-gram language models.

threads:m sets m threads (default: conccurrency detected by boost) batch_size:m sets the batch size for threading. Expect memory usage from this of 2threadsbatch_size n-grams.

There are two inputs: vocabulary and model. Either may be given as a file while the other is on stdin. Specify the type given as a file using vocab: or model: before the file name.

For ARPA format, the output must be seekable. For raw format, it can be a stream i.e. /dev/stdout` I will appreciate it if you provide a full command line:)

Jul 28 '20 07:07 DRosemei

The same question. What is right command line for bin/filter? Cause previously commands don't work. Thanks in advance.

Jul 21 '21 12:07 RuslanSel

kenlm kenlm copied to clipboard

Filtering: the format of the target vocabulary

kenlm
kenlm copied to clipboard