korektor icon indicating copy to clipboard operation
korektor copied to clipboard

Improve language model lookup speed

Open foxik opened this issue 10 years ago • 0 comments

If we sort out #10, that will leave us with the following complexity of language model lookups

  • 38.3/(100-47.6) = 73.1% ot time spend in language model lookup in diacritics_h2mor.conf configuration
  • 30.6/(100-44.2) = 54.8% ot time spend in language model lookup in spellchecking_h2mor.conf configuration

One way to go is to use #3 and hope KenLM is faster.

Alternatively, we could improve the existing LM implementation:

  • one possible improvement is to allow searching in LM using "state" and "another word in the sequence", instead of supplying N words all the time (i.e., we could identify a sequence being searched by a state, then we would supply another word and we would get new state and information about the last N-gram)
  • what I like better is to change the LM representation entirely (this will also sort out the previous point). Instead of tree structure, we would have a hash table for every order (i.e., one for unigrams, another for bigrams, etc.). In the hash table, we would use only hashes, not the n-grams themselves, i.e. a hash position H would contain informatou about all n-grams with hash H merged. Although this merges some of the n-grams, it would probably make little impact on accuracy. Searching in this structure would be really fast (and the structure can be quite large as it contains only an array of n-gram statistics, so the number of collisions could be quite small). Also, the "state" of current sentence can be simply the hash of last N-grams and by using appropriate hash we can update it to hash of N-gram resulting by append a word to current N-gram in constant time. I would guess that such implementation could be something like 5-10 times faster than the current one (and probably also smaller then the current LM models). [Edited] The above does not deal with non-existing n-grams, but these can be handled with using several ways (global Bloom filter; small Blom for every hash; or use perfect hashing and store the hashes)

But using KenLM is definitely easier, more manageable and probably just as fast :-)

foxik avatar Apr 29 '15 15:04 foxik