TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

Add some symbols to punctuation in strip_punctuation

Open remusao opened this issue 12 years ago • 4 comments

Hi,

Since " and ' are considered punctuation in English, I thought it would be a good idea to add this characters in the function strip_punctuation! in the preprocessing module. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.

Bests, Remusao

remusao avatar Nov 12 '13 11:11 remusao

This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do.

johnmyleswhite avatar Nov 12 '13 15:11 johnmyleswhite

I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like "toto

remusao avatar Nov 12 '13 15:11 remusao

Let's see what R's tm and Python's NLTK do, then make a decision.

johnmyleswhite avatar Nov 12 '13 16:11 johnmyleswhite

And is it possible to add "[" and "]" to exactly this regex? I had some problems with the remove_words! function, because there where such brackets inside my corpus and the closing ] was missed. But perhaps cleaner it would be to update the remove_words function and to clean regexSyntax out of of this word. Something like:

regexSigns = split("[]{}*()","")
for sign in regexSigns
    word = replace(word, Regex(string("\\",sign)),string("\\",sign))
end 

karl-kurzke avatar Feb 08 '14 21:02 karl-kurzke