anlp21 icon indicating copy to clipboard operation
anlp21 copied to clipboard

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21

Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Notebook Description
1.words/EvaluateTokenizationForSentiment The impact of tokenization choices on sentiment classification.
1.words/ExploreTokenization Different methods for tokenizing texts (whitespace, NLTK, spacy, regex)
1.words/TokenizePrintedBooks Design a better tokenizer for printed books
1.words/Text_Complexity Implement type-token ratio and Flesch-Kincaid Grade Level scores for text
2.compare/ChiSquare, Mann-Whitney Tests Explore two tests for finding distinctive terms
2.compare/Log-odds ratio with priors Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior
3.dictionaries/DictionaryTimeSeries Plot sentiment over time using human-defined dictionaries
3.dictionaries/Empath Explore using Empath dictionaries to characterize texts
4.embeddings/DistributionalSimilarity Explore distributional hypothesis to build high-dimensional, sparse representations for words
4.embeddings/WordEmbeddings Explore word embeddings using Gensim
4.embeddings/Semaxis Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold),
4.embeddings/BERT Explore the basics of token representations in BERT and use it to find token nearest neighbors
4.embedings/SequenceEmbeddings Use sequence embeddings to find TV episode summaries most similar to a short description
5.eda/WordSenseClustering Inferring distinct word senses using KMeans clustering over BERT representations
5.eda/Haiku KMeans Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters