anlp21
anlp21 copied to clipboard
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)
anlp21
Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html
| Notebook | Description |
|---|---|
| 1.words/EvaluateTokenizationForSentiment | The impact of tokenization choices on sentiment classification. |
| 1.words/ExploreTokenization | Different methods for tokenizing texts (whitespace, NLTK, spacy, regex) |
| 1.words/TokenizePrintedBooks | Design a better tokenizer for printed books |
| 1.words/Text_Complexity | Implement type-token ratio and Flesch-Kincaid Grade Level scores for text |
| 2.compare/ChiSquare, Mann-Whitney Tests | Explore two tests for finding distinctive terms |
| 2.compare/Log-odds ratio with priors | Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior |
| 3.dictionaries/DictionaryTimeSeries | Plot sentiment over time using human-defined dictionaries |
| 3.dictionaries/Empath | Explore using Empath dictionaries to characterize texts |
| 4.embeddings/DistributionalSimilarity | Explore distributional hypothesis to build high-dimensional, sparse representations for words |
| 4.embeddings/WordEmbeddings | Explore word embeddings using Gensim |
| 4.embeddings/Semaxis | Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), |
| 4.embeddings/BERT | Explore the basics of token representations in BERT and use it to find token nearest neighbors |
| 4.embedings/SequenceEmbeddings | Use sequence embeddings to find TV episode summaries most similar to a short description |
| 5.eda/WordSenseClustering | Inferring distinct word senses using KMeans clustering over BERT representations |
| 5.eda/Haiku KMeans | Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters |