kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

Expose vocab in Python API

Open igormq opened this issue 5 years ago • 4 comments

For ASR, we need access to the vocab in order to build a string trie and limit the beam search output.

Can you expose the vocabulary in PythonAPI?

igormq avatar Jul 23 '19 17:07 igormq

I hacked around this by reading the arpa file.

kcarnold avatar Mar 03 '20 22:03 kcarnold

Hi @kcarnold.

I hacked around this by reading the arpa file.

Can you publish your code for doing that? Thanks.

piegu avatar Feb 10 '22 17:02 piegu

https://github.com/kcarnold/suggestion/blob/master/suggestion/lang_model.py#L15

kcarnold avatar Feb 10 '22 18:02 kcarnold

Here's a small program I wrote to extract the vocabulary (in C++):

#include <iostream>
#include "model.hh"

using namespace lm;
using namespace lm::ngram;

class PrintWords : public EnumerateVocab {
 public:
  void Add(WordIndex index, const StringPiece &str) override {
    std::cout << str << std::endl;
  }
};

int main(int argc, char **argv) {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << " <lm>\n";
    return 1;
  }

  char *file = argv[1];

  PrintWords enum_vocab;
  Config config;
  config.enumerate_vocab = &enum_vocab;
  Model model(file, config);
}

danijel3 avatar Nov 07 '23 20:11 danijel3