sentometrics
sentometrics copied to clipboard
Danger when using `tokens` from a un-ordered corpus
I like tokenizing text by myself before using compute_sentiment. My usual framework is to start from a quanteda::corpus, from which I create a sento_corpus and a quanteda::tokens object.
I just realized that since as.sento_corpus re-order the quanteda::corpus, the order of the sento_corpus and the tokens object do not match. This leads to the wrong allocation of sentiment to texts.
I realize that in an ideal world, the safest way would be to use as.list(tokens(x))
when calling compute_sentiment
. But I feel that this error is very difficult to notice as there is no warning, and I see situations where you would handle tokenization separately from the sento_corpus object.
Reproducible example:
> library(quanteda)
> library(sentometrics)
> e <- data.frame(text = c("good good good", "bad bad bad bad bad"), date = c("2000-01-26", "2000-01-03"))
>
> corp <- corpus(e)
> st <- as.sento_corpus(corp)
We detected no features, so we added a dummy feature 'dummyFeature'.
> lex <- sento_lexicons(list_lexicons["GI_en"])
>
> compute_sentiment(st, lex)
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 5 -1
2: text1 2000-01-26 3 1
> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 3 1
2: text1 2000-01-26 5 -1
I thought the wrong ordering was fixed in this commit 6efe36742309282a35c52780ceaed6df73c8fe04, and version 0.8.2. The function as.sento_corpus() refers to sento_corpus() for the ordering.
What version are you using?
What is the output when you do compute_sentiment(corp, lex)
?
The issue is not about a wrong order - the re-ordering of sento_corpus is correct. The danger comes from the fact that the initial corpus is un-ordered, and so is the tokens constructed from the corpus. Carelessly using this tokens in compute_sentiment creates the issue.
I believe there should be some sort of warning or check to prevent using the tokens argument if the order does not match.
Version is 0.8.4, and here are some other outputs
> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 3 1
2: text1 2000-01-26 5 -1
> compute_sentiment(st, lex, tokens = as.list(tokens(st)))
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 5 -1
2: text1 2000-01-26 3 1
> compute_sentiment(corp, lex)
id word_count GI_en
1: text1 3 1
2: text2 5 -1
Alright, got it, that's good news at least! Outputs make sense.
In the documentation of compute_sentiment()
as part of the tokens
argument there is already this: "... Make sure the tokens are constructed from (the texts from) the x argument, are unigrams, and preferably set to lowercase, otherwise, results may be spurious and errors could occur. ...". It hints to the user it is their own responsibility.
What do you suggest to efficiently compare the corpus input with the tokens input?
Possible clean solution: whenever tokens is not null, print a message() saying "Make sure the tokens are constructed from (the texts from) the x argument!". Not sure if there's more we can do.
I think a printed message, especially when sento_corpus() re-order the corpus could help.
Alternatively, tokens
could expect a named list where names represent the texts' ID.
Sounds good. I prefer the first option, it’s the least invasive one, although the second one is a bit safer. Feel free to adapt and file a pull request, otherwise we’ll take this up later.