kumo icon indicating copy to clipboard operation
kumo copied to clipboard

Use Guava HashMultiset

Open ChrisHennickAtGoogle opened this issue 8 years ago • 2 comments

Guava's HashMultiset class would make it much faster to preprocess text. I'd suggest converting the raw tokens from languagetool to a HashMultiset<String> before any further processing, and using the entrySet() method to process each distinct token only once during normalization, filtering etc.

ChrisHennickAtGoogle avatar Jun 26 '16 03:06 ChrisHennickAtGoogle

I intentionally started with the mindset of not putting in to many dependencies. But if people are interested in performance (outside of better data structures/algorithms), I'd probably hook in GS collections (now Eclipse collections) :)

kennycason avatar Jun 26 '16 23:06 kennycason

Coming back to this, I now realize I misunderstood your initial intent. I agree that the normalizer could probably also just process on the already tokenized text. The current string copying/processing in Normalizer is a bit overkill.

kennycason avatar Dec 01 '17 03:12 kennycason