text-feat-lib icon indicating copy to clipboard operation
text-feat-lib copied to clipboard

Provide a comprehensive list of tokenizers, features, and general NLP things used for text analysis with examples. The initial focus is on features used for twitter data and sentiment analysis.

Text Features Library

Objective: Provide a comprehensive list of tokenizers, features, and general NLP things used for text analysis with examples. The initial focus on features used for twitter data and sentiment analysis.

Packages Used: nltk, spacy, pandas, scikit-learn

Notebooks

Features

(roughly in order of increasing complexity)

  1. [Count of ALL CAPS words](notebooks/ALL CAPS.ipynb)
  • [Count of Bible Verses](notebooks/Bible Verses.ipynb)
  • [Word n-grams (Bag of Words)](notebooks/Word n-grams - Bag Of Words - BOW.ipynb)
  • [Character n-grams](notebooks/Character n-grams.ipynb)
  • [Hashtag Counts](notebooks/Hashtag Counts - Bag of Hashtags.ipynb)
  • [Hashtag Bag of Words](notebooks/Hashtag BOW.ipynb)
  • [Named Entity Counts](notebooks/Named Entities Count - Bag of Named Entities.ipynb)
  • [Word n-grams - BOW - Sequence Split in Half](notebooks/Word n-grams - BOW - Sequence Split in Half.ipynb) ☨
  • [Brown Word Cluster assignments](notebooks/TweetNLP - Brown Word Clusters.ipynb)
  • [Bing Liu Lexicon Derived Features](notebooks/Bing Liu Lexicon Features.ipynb)
  • [NRC Hashtag Sentiments (unigrams)](notebooks/Bing Liu Lexicon Features.ipynb)
  • [NRC Hashtag Sentiments (bigrams)](notebooks/NRC Hashtag Sentiments - bigrams.ipynb)
  • [NRC Emotion Lexicon Features](notebooks/NRC Emotion Lexicon Features.ipynb)
  • [MaxDiff Twitter Sentiment Lexicon - unigrams and bigrams](notebooks/MaxDiff Twitter Sentiment Lexicon - unigrams and bigrams.ipynb) ☨
  • [Sentiment 140 - unigrams](notebooks/Sentiment140 - unigrams.ipynb) ☨
  • [Sentiment 140 - bigrams](notebooks/Sentiment140 - bigrams.ipynb) ☨

☨ - most recently updated

Notes on Features

  • Each feature is demonstrated on the STS-Gold dataset (see Datasets below).
  • Each feature is evaluated on accuracy using 5-fold cross validation with MultinomialNB (Multinomial Naive Bayes), BernoulliNB (Bernoulli Naive Bayes), and SVC (Support Vector Classifier) in comparison to selecting the most frequent class (DummyClassifier)
    • these are the most commonly used models I see for text classification tasks
  • The lexicons used for some advanced features have links within the notebook as well as in the Lexicons table below.
    • They are not hosted in this repository to respect the wishes and licenses of their creators.

Tokenizers

  • NLTK's casual tokenizer (TweetTokenizer)
    • Good for a tweet tokenization, takes hashtags, mentions, and emoji into account.
  • NLTK's word_tokenize (default is TreebankWordTokenizer)
    • Good for simplicity's sake. The de-facto industry-standard Penn Treebank tokenization
  • Christopher Potts Sentiment-Aware Tokenizer (HappyFunTokenizer)
    • Takes all tokens from NLTK's TweetTokenizer, and adds/changes:
      • HTML character resolution
      • HTML feature tokenization (ie. tags like <\b>)
      • Does not split contractions
    • Has negation approximating method for additional semantic features
    • Updated to work with python 3
  • spaCy's default tokenizer (uses Penn Treebank)
    • Good for a faster tokenizing in comparison to NLTK's word_tokenize
  • TweetNLP Tokenizer (python port)
    • Haven't used, I imagine it's similar to Sentiment-Aware and TweetTokenizer
    • Additional POS tagging features available on website

Datasets

Sentiment Classification

Source: Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold

Dataset Description Download Link Sentiment Transformation*
Stanford Twitter Sentiment Test Set (STS-Test/Sentiment140) The Stanford Twitter sentiment corpus, introduced by Go et al. consists of two different sets, training and test. The training set contains 1.6 million tweets automatically labelled as positive or negative based on emotions. For example, a tweet is labelled as positive if it contains :), :-), : ), :D, or =) and is labelled as negative if it contains :(, :-(, or : (. The test set (STS-Test), on the other hand, is manually annotated and contains 177 negative, 182 positive and 139 neutrals tweets. download link (0 = negative, 2 = neutral, 4 = positive)
Sanders-Twitter This free data set is for training and testing sentiment analysis algorithms. It consists of 5,513 hand-classified tweets. Each tweet was classified with respect to one of four different topics. Download script was not fully functioning as of Dec 28, 2015. Do some creative googling to find a fuller dataset. download script None
Sentiment Strength Twitter Dataset (SS-Tweet) This dataset consists of 4,242 tweets manually labelled with their positive and negative sentiment strengths. i.e., a negative strength is a number between -1 (not negative) and -5 (extremely negative). Similarly, a positive strength is a number between 1 (not positive) and 5 (extremely positive). The dataset was constructed by (source) to evaluate SentiStrenth, a lexicon-based method for sentiment strength detection. download link Positive if positive sentiment strength / negative > 1.5, and negative otherwise. Neutral if abs(positive / negative) = 1
Health Care Reform (HCR) The Health Care Reform (HCR) dataset was built by crawling tweets containing the hashtag “#hcr” (health care reform) in March 2010. A subset of this corpus was manually annotated by the authors with 5 labels (positive, negative, neutral, irrelevant, unsure(other)) and split into training (839 tweets), development (838 tweets) and test (839 tweets) sets. download link None
Obama-McCain Debate (OMD) The Obama-McCain Debate (OMD) dataset was constructed from 3,238 tweets crawled during the first U.S. presidential TV debate in September 2008. Sentiment labels were acquired for these tweets using Amazon Mechanical Turk, where each tweet was rated by at least three annotators as either positive, negative, mixed, or other. download link Labels which two-third of the voters agree on are final labels of the tweets
STS-Gold The STSGold dataset contains 13 negative, 27 positive and 18 neutral entities as well as 1,402 negative, 632 positive and 77 neutral tweets. (details in Section 3) download link 0=negative, 4=positive

* transformations applied to convert various sentiment scores to sentiment labels

Lexicons

Lexicons are typically used for feature creation for sentiment classification tasks. Source: NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets

Lexicon Description Download Link Features Created
NRC Emotion Lexicon Both Word-Emotion and Word-Sentiment Association Lexicon NRC Word-Emotion Association Lexicon
MPQA Subjectivity Lexicon The Subjectivity Lexicon (list of subjectivity clues) that is part of OpinionFinder is also available for separate download. These clues were compiled from several sources (see the enclosed README). Subjectivity Lexicon -
Bing Liu Sentiment Lexicon A list of positive and negative opinion words or sentiment words for English (around 6800 words). This list was compiled over many years starting from our first paper (Hu and Liu, KDD-2004). Opinion Lexicon
NRC Hashtag Sentiment Lexicon These lexicons were used to generate winning submissions for the sentiment analysis shared tasks of SemEval-2013 Task 2 and SemEval-2014 Task 9. NRC Hashtag Sentiment Lexicon
Sentiment140 Lexicon Lexicon generated from the Sentiment140 dataset Sentiment140 Lexicon
MPQA Effect Lexicon Lexicon indicating whether a wordnet synset has a positive, negative, or no effect on entities +/- Effect Lexicon ~
MaxDiff Twitter Sentiment Lexicon The lexicon provides real-valued scores for the strength of association of terms with positive sentiment. The sentiment annoations were done manually through Mechanical Turk using the MaxDiff method of annotation. Max Diff Twitter Sentiment Lexicon
AFINN AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011. AFINN Python Implementation -
SentiWordNet SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity. SentiWordNet -

Literature / Papers

  • NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets
    • Winners of the SemEval 2013/2014 Twitter Sentiment Analysis tasks describe their methodology
  • NRC-Canada-2014: Recent Improvements in the Sentiment Analysis of Tweets
    • Winners of SemEval 2013/2014 Twitter describe additional improvements to sentiment analysis due to negation contexts (handling ***n't, not, no, never, etc.)
  • Harnessing Twitter ‘Big Data’ for Automatic Emotion Identification
    • Some interesting feature ideas like splitting a tweet in half and doing 2 BOWs
    • we also hypothesize that the words located towards the end of a tweet are more important than other words, because people usually summarize or highlight their points in the end. For example, “I hate it when stuff like that happens,.. ;/ thank god it worked out.<3 #thankful.”. Although “hate” appears in the first half of the tweet, the overall emotion is dominated by “thank” in the latter half. We encoded the position information into a feature by attaching a number (i.e, 1 or 2) to each n-gram to indicate whether it is in the first half or the second half of the tweet
  • On Assessing the Sentiment of General Tweets
    • we include a new lexico-syntactic binary feature <w>-POS(w) where ‘w’ is a sentiment expressing word found in sentiment lexicons and POS(w) is its part-of-speech as observed in an input tweet. For such words, based on the dependency parse [5] of a tweet, we also introduce another lexico-syntactic binary feature <w>-<g/d>-<dtype>, where ‘dtype’ is the type of a dependency relation involving ‘w’ and ‘g/d’ is determined based on whether the relation has ‘w’ as a governor (g) or dependent