public_domains icon indicating copy to clipboard operation
public_domains copied to clipboard

Consider improving performance

Open nemobis opened this issue 3 years ago • 2 comments

Perhaps it's overkill to change the ranking method, but I tested this on a 180 MB text file (probably not a good idea anyway) and it was still not done after some 20 hours of CPU time.

For comparison, an off-the-shelf BigramCollocationFinder.from_words(tokens).nbest(BigramAssocMeasures().pmi, 10000) takes about 5 minutes on the same machine and corpus, and it's probably easy to do better. (I cobbled together an example at https://framagit.org/nemobis/bots/-/blob/master/ngram_tlds.py .)

nemobis avatar Nov 08 '22 07:11 nemobis

Interesting @nemobis! Have you compared the results of each approach at all?

edsu avatar Nov 28 '22 19:11 edsu

I tried, but I was doing Italian (more difficult) and the input I used was too dirty for the simplistic BigramCollocationFinder above, so I cut my losses and threw away everything. :) The suggestions weren't outrageously bad but it may need some tweaking for the tokenization.

nemobis avatar Nov 28 '22 19:11 nemobis