generalized-language-modeling-toolkit icon indicating copy to clipboard operation
generalized-language-modeling-toolkit copied to clipboard

replace working on strings

Open renepickhardt opened this issue 11 years ago • 0 comments

this entire issue is tbd.

we should create an index from words (tokens) to integer and just working on sequences of integer.

If we assume 64 Bit long we can encode 1.8 * 10^19 words. (way too much) In this sense we could encode a 3-gram in one register using 21 Bit per token with the ability to store 2^21 tokens

The english wikipedia consists of about 11 Mio tokens but most tokens occure only once an could probably be discarded. The other way would be to use 32 Bit per token and be able to store a 2-gram in a 64 Bit register.

some statistics english wikipedia: 108 Mio 2-grams 379 Mio 3-grams 702 Mio 4-grams 914 Mio 5-grams

Assuming a 3-gram per long we would need 4.5 GB to store all 3 grams of the English wikipedia. (currently 7.9 GB)

the question is if we can do calculations faster and I would assume we could.

one would have to look at current binary foramts and frameworks. eg:

  • http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html
  • http://svr-www.eng.cam.ac.uk/~prc14/toolkit_documentation.html
  • http://www.statmt.org/wmt07/pdf/WMT12.pdf

renepickhardt avatar Apr 05 '14 13:04 renepickhardt