Add some symbols to punctuation in strip_punctuation
Hi,
Since " and ' are considered punctuation in English, I thought it would be a good idea to add this characters in the function strip_punctuation! in the preprocessing module. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.
Bests, Remusao
This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do.
I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like "toto
Let's see what R's tm and Python's NLTK do, then make a decision.
And is it possible to add "[" and "]" to exactly this regex? I had some problems with the remove_words! function, because there where such brackets inside my corpus and the closing ] was missed. But perhaps cleaner it would be to update the remove_words function and to clean regexSyntax out of of this word. Something like:
regexSigns = split("[]{}*()","")
for sign in regexSigns
word = replace(word, Regex(string("\\",sign)),string("\\",sign))
end