Rene Pickhardt issues

Results 85 issues of


                                            Rene Pickhardt

Standard Formats

the output of language models and n-grams should follow standard formats e.g. weighted finite state transducer format (WFST) ARPA format currently I am not sure if more formats exist. they...

enhancement

pruning of words and n-grams

we could prune all rare sequences and words. and also we don't have this token which others have. it would be interesting to play around with this. also to see...

enhancement

going away from java.util.HashMap

We could consider going away from java uitil hashmap This could speed up the aggregator as well as the kneser ney smoother. trove4j seems to be an option: - http://trove4j.sourceforge.net/html/benchmarks.shtml...

enhancement

Flow diagram

this should help others to use the software. I expect the pipline to be similar even if parts of the software get rewritten. the current processing pipeline on github is...

Javadoc

at least for the public API we need a clear java doc so that people can use the classes and know how to use them intuitively

Install Script

the install script should ask the user - how much main memory he wants to spend - offer him the possibility to download and install the stanford part of speech...

indexing the ngrams

it might be interesting already in this toolkit to index the ngrams using FSTs or trieBased solutions. This is something that we should discuss since this seems like a rather...

enhancement

distribute calculation via map reduce cluster

at some point in time we have to think about handling large data sets like web crawls

enhancement

replace working on strings

this entire issue is tbd. we should create an index from words (tokens) to integer and just working on sequences of integer. If we assume 64 Bit long we can...

enhancement

be able to run the toolkit out of the box

a sample data set and install script (fitting to the data set) should be provided if the software does not have the correct data sets set up more meaningfull and...