juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews | Julia Silge

Open utterances-bot opened this issue 3 years ago • 12 comments

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews | Julia Silge

A lot has been happening in the tidymodels ecosystem lately! There are many possible projects we on the tidymodels team could focus on next; we are interested in gathering community feedback to inform our priorities.

https://juliasilge.com/blog/animal-crossing/

utterances-bot avatar May 27 '21 05:05 utterances-bot

@Juliasilge, this is the second work I watched from you and I am very impressed. Thank you so much for all the great work. I already used your Linear SVM model in TV shows vs Series' video. But, after watching this video I noticed I can develop this model too where I have Healthy (Positive) vs Sick (Negative). My question is that, as far as I understood you basically use a sort of Logistic regressionfor classification whereas I was expecting to see some use ofsentiment analysispackages such asAFINN,bingornrcto score the texts and then cross validates them with the grades in the data. Could you please explain a bit what are the drawbacks of the default sentiment packages that made you develop a Lasso regression for scoring (penalties`)? Again, I truly appreciated you sharing your extensive knowledge with us.

BehnamCA avatar May 27 '21 05:05 BehnamCA

@BehnamCA In this blog post, we build a model to learn which tokens are predictive of text being scored positively or negatively; this is a great approach when you have labeled text data and almost always better than using sentiment lexicons. If you have unlabeled text data and want to estimate the affect/sentiment content of the text, sentiment lexicons can be a good option to use.

juliasilge avatar May 27 '21 22:05 juliasilge

@juliasilge, thanks a lot. It was very helpful as always.

BehnamCA avatar May 28 '21 03:05 BehnamCA

I have learnt a lot of from your tutorials. Why we should center and scale the data for this model? I did try to search it on Google, but I still don't find the answer. Could you explain this?

nguyenlovesrpy avatar Sep 01 '21 08:09 nguyenlovesrpy

In our book we have a set of recommended preprocessing steps for different models. For glmnet in particular, check out the "Preprocessing requirements" in the parsnip docs to see what happens there.

juliasilge avatar Sep 01 '21 15:09 juliasilge

Although textmining is not my major field but it always very helpful to watch your model fitting process, one question for this screecast: you have created the test dataset but did not use it when you use last_fit() function...what's the reason for that? did the workflow handle it already?

conlelevn avatar May 13 '22 04:05 conlelevn

@conlelevn The last_fit() function takes the split as its input, which contains the info on both the training and testing data sets. This function will fit one final time to the training data and evaluate one final time on the testing data.

juliasilge avatar May 13 '22 04:05 juliasilge

thank you julia for your blogs i learn from you a lot i have a question why did you say when you plot a histogram of number of word in each review is weired distribution

mohamedelhilaltek avatar Aug 12 '23 06:08 mohamedelhilaltek

@mohamedelhilaltek I believe you are referring to when I was looking at this plot:

image

Notice that sharp drop/dip around 100 words or so? Almost certainly that can't be a "real", natural distribution for number of words that people use in their reviews but instead an artifact of how this data was collected.

juliasilge avatar Aug 12 '23 21:08 juliasilge

Hi @juliasilge, sorry for asking. Is it possible that the package vip has changed some specifications from 2020 to 2023? I have done exactly the same steps that you have done, but even though the predictions are comparable, the Importance values are really different. Yours go 0 to 15 and mine are 0 to 0.5. In fairness, I don't really know what you do when you use vi, so I am following along without really knowing what I'm doing in that part, but the difference in these numbers is astonishing and I can't figure out why. Any help to get to the bottom of this would be greatly appreciated.

Thank you again for all your hard work with these videos and for the interesting tutorials that you have posted on youtube. I've gained a wealth of knowledge thanks to you.

acarpignani avatar Nov 30 '23 09:11 acarpignani

@acarpignani No need to apologize! I think the vip documentation is pretty good at explaining what it's doing for different models. In the case here, we are using model specific variable importance (i.e. getting the variable importance from the structure of the model itself), and you can read about what the means for different kinds of models.

I'm using glmnet here and it looks like there have definitely been some fixes for glmnet since I wrote this blog post. The differences are probably due to that. 👍

juliasilge avatar Nov 30 '23 17:11 juliasilge

Thank you ever so much, @juliasilge. It must be because of that. Shame because now it comes out that the importance bar charts give me now much less variables, most of them being zero. But I got the overall idea of this tutorial. By the way, I wholly adore the way you reshape the data sets to make the data set you need. I wish I was the 1% as good as you are at that 😍

acarpignani avatar Dec 01 '23 12:12 acarpignani