sentometrics
sentometrics copied to clipboard
A possible way to deal with elisions
The current tokenization leaves French elisions attached to their words. This causes some sentiment words to not be identified when computing sentiment. For example, "l'abandon" is not identified as negative whereas "abandon" is a negative word in the French LoughranMcDonald lexicon.
This pull request adds an argument to compute_sentiment
, defaulting to TRUE
, that simply removes a number of elision patterns at the beginning of each word. I'm not certain how this can affect other languages, but I don't see how to make a language-specific filter with the current implementation.
See the test file for an example.
Nice addition, well documented & good unit test! Some feedback:
- Prefer to change
remove_elisions
todo.removeElisions
(consistent with naming of logicals, cf.do.ignoreZeros
). - Because you're not sure about the impact on other languages, and to not break existing examples or scripts, it might be smarter to let the new argument default to
FALSE
? Your choice. - You'll also have to add the new argument to the
ctr_agg()
function. - You can also add yourself as a contributor in the DESCRIPTION file, and change the version to 1.1.0.
Once the changes are pushed, we can merge.