tidytext icon indicating copy to clipboard operation
tidytext copied to clipboard

enabling existing international nrc lexicon in get_sentiments()

Open LeWaHe opened this issue 4 years ago • 3 comments

Hi, I love learning tidytext but was a bit surprised to see that the get_sentiments() function does not allow to use the non-english translations included within the Nov 2017 nrc lexicon v.092 xlsx file used by tidytext(english words are in column A, and are translated in dozens of languages from columns B to DA while DB to DK list the polarity and sentiment scores for each word). It would be amazing to add an argument to define which language (column) to use from the nrc lexicon i.e lang="French". Thanks, Leonard

LeWaHe avatar Apr 30 '20 14:04 LeWaHe

The NRC-Emotion-Lexicon.zip file that is currently downloaded via the function in the textdata package does include that .xlsx file you are mentioning. Using these translations is within the permission we have from the lexicon creators, although of course translated sentiment lexicons can be less reliable.

@EmilHvitfeldt do you want to consider this in textdata?

juliasilge avatar Apr 30 '20 16:04 juliasilge

I'm on it!

EmilHvitfeldt avatar Apr 30 '20 16:04 EmilHvitfeldt

Thank you for your answers, great to know using the translations is within the permissions from the lexicon creators. I concur that using translated lexicons is less reliable than a natively created one. However, (i) for analyses comparing corpora spanning across different languages a single lexicon would be more reliable than a patchwork of different lexicons (ii) many languages, spoken by millions of people still lack reliable native lexicons. Thanks

LeWaHe avatar Apr 30 '20 16:04 LeWaHe