sentometrics icon indicating copy to clipboard operation
sentometrics copied to clipboard

A possible way to deal with elisions

Open odelmarcelle opened this issue 3 years ago • 1 comments

The current tokenization leaves French elisions attached to their words. This causes some sentiment words to not be identified when computing sentiment. For example, "l'abandon" is not identified as negative whereas "abandon" is a negative word in the French LoughranMcDonald lexicon.

This pull request adds an argument to compute_sentiment, defaulting to TRUE, that simply removes a number of elision patterns at the beginning of each word. I'm not certain how this can affect other languages, but I don't see how to make a language-specific filter with the current implementation.

See the test file for an example.

odelmarcelle avatar Dec 03 '21 13:12 odelmarcelle

Nice addition, well documented & good unit test! Some feedback:

  • Prefer to change remove_elisions to do.removeElisions (consistent with naming of logicals, cf. do.ignoreZeros).
  • Because you're not sure about the impact on other languages, and to not break existing examples or scripts, it might be smarter to let the new argument default to FALSE? Your choice.
  • You'll also have to add the new argument to the ctr_agg() function.
  • You can also add yourself as a contributor in the DESCRIPTION file, and change the version to 1.1.0.

Once the changes are pushed, we can merge.

sborms avatar Jan 01 '22 14:01 sborms