kenlm
kenlm copied to clipboard
Expose vocab in Python API
For ASR, we need access to the vocab in order to build a string trie and limit the beam search output.
Can you expose the vocabulary in PythonAPI?
I hacked around this by reading the arpa
file.
Hi @kcarnold.
I hacked around this by reading the
arpa
file.
Can you publish your code for doing that? Thanks.
https://github.com/kcarnold/suggestion/blob/master/suggestion/lang_model.py#L15
Here's a small program I wrote to extract the vocabulary (in C++):
#include <iostream>
#include "model.hh"
using namespace lm;
using namespace lm::ngram;
class PrintWords : public EnumerateVocab {
public:
void Add(WordIndex index, const StringPiece &str) override {
std::cout << str << std::endl;
}
};
int main(int argc, char **argv) {
if (argc != 2) {
std::cerr << "Usage: " << argv[0] << " <lm>\n";
return 1;
}
char *file = argv[1];
PrintWords enum_vocab;
Config config;
config.enumerate_vocab = &enum_vocab;
Model model(file, config);
}