Rene Pickhardt
Rene Pickhardt
the output of language models and n-grams should follow standard formats e.g. weighted finite state transducer format (WFST) ARPA format currently I am not sure if more formats exist. they...
we could prune all rare sequences and words. and also we don't have this token which others have. it would be interesting to play around with this. also to see...
We could consider going away from java uitil hashmap This could speed up the aggregator as well as the kneser ney smoother. trove4j seems to be an option: - http://trove4j.sourceforge.net/html/benchmarks.shtml...
this should help others to use the software. I expect the pipline to be similar even if parts of the software get rewritten. the current processing pipeline on github is...
at least for the public API we need a clear java doc so that people can use the classes and know how to use them intuitively
the install script should ask the user - how much main memory he wants to spend - offer him the possibility to download and install the stanford part of speech...
it might be interesting already in this toolkit to index the ngrams using FSTs or trieBased solutions. This is something that we should discuss since this seems like a rather...
at some point in time we have to think about handling large data sets like web crawls
this entire issue is tbd. we should create an index from words (tokens) to integer and just working on sequences of integer. If we assume 64 Bit long we can...
a sample data set and install script (fitting to the data set) should be provided if the software does not have the correct data sets set up more meaningfull and...