ankusa
ankusa copied to clipboard
TextHash should not assume everything is in english
TextHash currently assumes all input is in english and will therefor atomize, stem and skip stopwords. As I'm using Ankusa for another language I'd prefer to skip these methods. They cost processing time, cause unnecessary GC, skew the results and can have unintended consequences: who says "-ing" isn't a relevant information in other languages? What if "-" should not be replaced by " " but by ""? What if I have highly relevant words that are < 3 chars or include a number? Example: "G7-Summit".
Ankusa should not attempt to manipulate the input if I'm passing TextHash an array instead of a string - currently it will attempt all the above with each item of an array, instead of assuming that the input is already properly tokenized and therefor directly using add_word(word).
If you'd like i18n support, then there are a few changes in the project that would have to be made, for instance adding the ability to have other lists of stopwords as well as per-language versions of TextHash. This is not something I'll have time for in the near future, but if you'd like to submit a pull request (after talking through some ideas) I'd be happy to merge. Changing the API of TextHash (for instance, to not process elements in an array as you suggest) is not something that I would consider because it would break too much existing code in the projects that use this library.
As for i18n support, I don't have the necessary knowledge to add it to the gem. Regarding the API there might not even be a need for breaking it, as currently TextHash#initialize expects up to two arguments, but all three instances of it being called only pass the first argument.
What if the second (currently unused) argument in #initialize was used for defining if a string should be tokenized, stemmed and stopword-filtered? It can default to true (not breaking anything), but when setting it to false it could directly process the input.
It may not be used internally, but I know for certain that second argument is being used by other projects and libraries.
On Oct 13, 2014, at 10:18, maia [email protected] wrote:
As for i18n support, I don't have the necessary knowledge to add it to the gem. Regarding the API there might not even be a need for breaking it, as currently TextHash#initialize expects up to two arguments, but all three instances of it being called only pass the first argument.
What if the second (currently unused) argument in #initialize was used for defining if a string should be tokenized, stemmed and stopword-filtered? It can default to true (not breaking anything), but when setting it to false it could directly process the input.
— Reply to this email directly or view it on GitHub.
That just opened a can of worms. :)