Stanford-CS224n-NLP
Stanford-CS224n-NLP copied to clipboard
The course notes about Stanford CS224n Natural Language Processing with Deep Learning Winter 2019 (using PyTorch)
Stanford CS224n Natural Language Processing with Deep Learning
The course notes about Stanford CS224n Winter 2019 (using PyTorch)
Some general notes I'll write in my Deep Learning Practice repository
Course Related Links
- Course Main Page: Winter 2019 (latest)
- Lecture Videos
- Stanford Online Hub - CS224n
Schedule
| Week | Lectures | Assignments |
|---|---|---|
| 2019/7/1~7/7 | Introduction and Word Vectors, Word Vectors 2 and Word Senses | Assignment 1 |
| 2019/7/8~7/14 | Word Window Classification, Neural Networks, and Matrix Calculus | - |
| 2019/7/15~7/21 | Backpropagation and Computation Graphs | Assignment 2 |
| 2019/10/21~10/27 | Linguistic Structure: Dependency Parsing | - |
| 2019/10/28~11/3 | Recurrent Neural Networks and Language Models | Assignment 3 |
| 2019/11/4~11/10 | Vanishing Gradients and Fancy RNNs, Machine Translation, Seq2Seq and Attention | Assignment 4 |
| 2019/11/11~11/17 | Transformers and Self-Attention For Generative Models, Modeling contexts of use: Contextual Representations and Pretraining | - |
| 2019/11/18~11/24 | Practical Tips for Projects, Question Answering, ConvNets for NLP, Subword Models | Assignment 5 |
| 2019/11/25~12/1 | [Project: Question Answering], Natural Language Generation | - |
| 2019/12/2~12/8 | [Project: Question Answering] | - |
| 2019/12/9~12/15 | Reference in Language and Coreference Resolution | - |
| 2020/1/13~1/19 | Multitask Learning: A general model for NLP? | - |
Lecture
- [X] Introduction and Word Vectors
- [X] Word Vectors 2 and Word Senses
- [X] Word Window Classification, Neural Networks, and Matrix Calculus
- [X] Backpropagation and Computation Graphs
- [X] Linguistic Structure: Dependency Parsing
- [X] The probability of a sentence? Recurrent Neural Networks and Language Models
- [X] Vanishing Gradients and Fancy RNNs
- [X] Machine Translation, Seq2Seq and Attention
- [X] Practical Tips for Final Projects - Default Final Project
- [X] Question Answering and the Default Final Project - Default Final Project
- [X] ConvNets for NLP
- [X] Information from parts of words: Subword Models - Assignment 5
- [X] Modeling contexts of use: Contextual Representations and Pretraining - ELMo, BERT
- [X] Transformers and Self-Attention For Generative Models - Self-attention, Transformer
- [X] Natural Language Generation
- [X] Reference in Language and Coreference Resolution
- [X] Multitask Learning: A general model for NLP?
- [ ] Constituency Parsing and Tree Recursive Neural Networks - TODO
- [ ] Safety, Bias, and Fairness
- [ ] Future of NLP + Deep Learning
Assignment
- [X] Exploring Word Vectors
- [X] word2vec
- [X] code
- [X] written
- [X] Dependency Parsing
- [X] code
- [X] written
- [X] Nerual Machine Translation
- [X] code
- [X] written
- [ ] Character-based Neural Machine Translation
- [X] code
- [ ] written - TODO
Project
- [ ] Question Answering (Default)
- [ ] Summerization
Paper reading
- [X] word2vec
- [ ] negative sampling
- [ ] GloVe
- [ ] improveing distrubutional similarity
- [ ] embedding evaluation methods
- [X] Transformer
- [X] ELMo
- [X] BERT
- [ ] fastText
Derivation
- [ ] backprop
Lectures
Lecture 1: Introduction and Word Vectors
- slides
- notes
- readings
- [ ] Word2Vec Tutorial - The Skip-Gram Model
- [ ] Efficient Estimation of Word Representations in Vector Space (original word2vec paper)
- [ ] Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)
- Gensim example
- preparing embedding: download this zip file and unzip the
glove.6B.*d.txtfiles intoembedding/GloVedirectory
- preparing embedding: download this zip file and unzip the
Outline
- Introduction to Word2vec
- objective function
- prediction function
- how to train it
- Optimization: Gradient Descent & Chain Rule
Lecture 2: Word Vectors 2 and Word Senses
- slides
- notes
- readings
- [ ] GloVe: Global Vectors for Word Representation (original GloVe paper)
- [ ] Improving Distributional Similarity with Lessons Learned from Word Embeddings
- [ ] Evaluation methods for unsupervised word embeddings
- additional readings
- [ ] A Latent Variable Model Approach to PMI-based Word Embeddings
- [ ] Linear Algebraic Structure of Word Senses, with Applications to Polysemy
- [ ] On the Dimensionality of Word Embedding
Outline
- More detail to Word2vec
- Skip-grams (SG)
- Continuous Bag of Words (CBOW)
- Similarity visualization
- Co-occurrence matrix + SVD (LSA) vs. Embedding
- Evaluation on word vectors
- Intrinsic
- Extrinsic
CS 168 The Modern Algorithmic Toolbox - for SVD
Lecture 3: Word Window Classification, Neural Networks, and Matrix Calculus
- slides
- matrix calculus
- notes
- readings
- [ ] CS231n notes on backprop
- [ ] Review of differential calculus
- additional readings
- [ ] Natural Language Processing (Almost) from Scratch
Outline
- Some basic idea of NLP tasks
- Matrix Calculus
- Jacobian Matrix
- Shape convention
- Loss
- Softmax
- Cross-entropy
Lecture 4: Backpropagation and Computation Graphs
- slides
- notes - same as lecture 3
- readings
- [ ] CS231n notes on network architectures
- [ ] Learning Representations by Backpropagating Errors
- [ ] Derivatives, Backpropagation, and Vectorization
- [ ] Yes you should understand backprop
Outline
- Computational Graph
- Backprop & Forwardprop
- Introducing regularization to prevent overfitting
- Non-linearity: activation functions
- Practical Tips
- Parameter Initialization
- Optimizers
- plain SGD
- more sophisticated adaptive optimizers
- Learing Rates
Lecture 5: Linguistic Structure: Dependency Parsing
- slides
- notes
- readings
- [ ] Incrementality in Deterministic Dependency Parsing
- [ ] A Fast and Accurate Dependency Parser using Neural Networks
- [ ] Dependency Parsing
- [ ] Globally Normalized Transition-Based Neural Networks
- [ ] Universal Stanford Dependencies: A cross-linguistic typology
- [X] Universal Dependencies website
Outline
- Methods of Dependency Parsing
- Dynamic Programming
- complexity O(n³)
- Graph Algorithm
- create a minimum spanning tree for a sentence
- Constraint Satisfaction
- edges are eliminated that don't satisfy hard constraints
- Transition-based Parsing / Deterministic Dependency Parsing
- greedy choice of attachments guided by machine learning classifier
- complexity O(n)
- Dynamic Programming
- Operations of the Shift-reduce Parser
- Shift
- Left-Arc
- Right-Arc
- Attachment Errors
- Prepositional Phrase Attachment Errors
- Verb Phrase Attachment Errors
- Modifier Attachment Errors
- Coordination Attachment Errors
mentioned CS103, CS228
Lecture 6: The probability of a sentence? Recurrent Neural Networks and Language Models
- slides
- notes
- readings
- [ ] N-gram Language Models (textbook chapter)
- [ ] The Unreasonable Effectiveness of Recurrent Neural Networks (blog post overview)
- [ ] Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.1 and 10.2)
- [ ] On Chomsky and the Two Cultures of Statistical Learning
- N-gram Language Model
- Fixed-window Neural Language Model
- vanilla RNN
- Language Modeling: the task of predicting the next word, given the words so far
- Language Model: a system that produces the probability distribution for the next candidate word
- Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x
- Machine Translation (x=source sentence, y=target sentence)
- Summarization (x=input text, y=summarized text)
- Dialogue (x=dialogue history, y=next utterance)
- ...
Lecture 7: Vanishing Gradients and Fancy RNNs
- slides
- notes - same as lecture 6
- readings
- [ ] Sequence Modeling: Recurrent and Recursive Neural Nets - (textbook sections 10.3, 10.5, 10.7-10.12)
- [ ] Learning long-term dependencies with gradient descent is difficult (one of the original vanishing gradient papers)
- [ ] On the difficulty of training Recurrent Neural Networks (proof of vanishing gradient problem)
- [ ] Vanishing Gradients Jupyter Notebook (demo for feedforward networks)
- [X] Understanding LSTM Networks (blog post overview)
Vanishing gradient =>
- LSTM and GRU
Lecture 8: Machine Translation, Seq2Seq and Attention
- slides
- notes
- readings
- [ ] Statistical Machine Translation slides, CS224n 2015 (lectures 2/3/4)
- [ ] Statistical Machine Translation (book by Philipp Koehn)
- [ ] BLEU (a Method for Automatic Evaluation of Machine Translate) (original paper)
- [ ] Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- [ ] Sequence Transduction with Recurrent Neural Networks (early seq2seq speech recognition paper)
- [ ] Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
- [ ] Attention and Augmented Recurrent Neural Networks (blog post overview)
- [ ] Massive Exploration of Neural Machine Translation Architectures (practical advice for hyperparameter choices)
- Training method: Teacher Forcing
- During training, we feed the gold (aka reference) target sentence into the decoder, regardless of what the decoder predicts.
- During testing (decoding): Beam Search vs. Greedy Decoding
- Decoding Algorithm: an algorithm you use to generate text from your language model
- Greedy Decoding => lack of backtracking
- on each step take the most probable word (i.e. argmax)
- use that as the next word, and feed it as input on the next step
- keep going until you produce
<END>or reach some max length- Beam Search: aims to find high-probability sequence by tracking multiple possible sequences at once
- on each step of decoder, keep track of the k (beam size) most probable partial sequences (hypotheses)
- after you reach some stopping criterion (get n complete hypotheses (each stop when reach max depth, produce
<END>)), choose the sequence with the highest probability (with score normalization)
Lecture 13: Modeling contexts of use: Contextual Representations and Pretraining
- slides
- readings
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
- Contextual Word Representations: A Contextual Introduction
ELMo, BERT
Lecture 14: Transformers and Self-Attention For Generative Models
guest lecture
- slides
- readings
- [ ] Attention is all you need
- [ ] Image Transformer
- [ ] Music Transformer: Generating music with long-term structure
Self-attention, Transformer
Lecture 9: Practical Tips for Final Projects
- slides
- notes - Good notes about finding existing research, datasets and tasks
- readings
- [ ] Practical Methodology (Deep Learning book chapter)
Vanishing Gradient, LSTM, GRU (again)
Lecture 10: Question Answering and the Default Final Project
- slides
- notes
some more Attention, mentioned CS 276: Information Retrieval and Web Search
Quick notes about QA:
- QA types
- Factoid QA: answer is an NER (some clear semantic type entity)
- Extractive QA: answer must be a span (a sub-sequence of words) in the passage
- e.g. SQuAD 1.X
- defect: all questions have an answer in the paragraph => turned into a kind of a ranking task
- Extractive QA + NoAnswer: some question might have no answer in the paragraph
- e.g. SQuAD 2.0
- limitation:
- only span-based answers (no yes/no, counting, implicit why)
- ...
- Open-domain QA
Lecture 11: ConvNets for NLP
- slides
- notes
- readings
mentioned CS231n: Convolutional Neural Networks for Visual Recognition
Lot of common technique (nowadays)
- Model Comparison
- Bag of Vectors: take the word vectors and averaging them
- good baseline
- better have followed by a few ReLU
- Window Model
- good for single word classification (for problems that don't need wide context e.g. POS, NER)
- CNNs
- good for classification
- need zero padding for shorter phrases
- easy to parallelize
- RNNs
- cognitively plausible (reading from left to right)
- not best for classification (if just use last state)
- much slower than CNNs
- good for sequence tagging
- great for language models and can be amazing with attention mechanism
- Bag of Vectors: take the word vectors and averaging them
- Dropout
- for regularization => prevent overfitting
- gives 2~4% accuracy improvement
- Gated units used vertically: shortcut connection (is needed for very deep networks to work)
- Residual block
- Highway block
- BatchNorm
- Z-transform: zero mean and unit variance
Lecture 12: Information from parts of words: Subword Models
- slides
- readings
fastText
Lecture 15: Natural Language Generation
- slides
Outline
- Decoding mehtods
- Greedy decoding
- Beam search
- Sampling-based decoding: good for open-ended/creative generation (poetry, stories)
- Pure sampling: like greedy decoding, but sample instead of argmax
- Top-n sampling: like pure sampling, but truncate the probability distribution
Softmax temperature: another way to control diversity
- NLG Tasks
- Machine Translation
- (Abstractive) Summarization
- Evaluation: ROUGE
- Dialogue
- chit-chat
- task-based
- Creative writing
- Storytelling
- Poetry-generation
- Freefrom Question Answering
- Image captioning
- ...
- NLG Evaluation Metrics
- Word overlap based metrics
- BLEU
- ROUGE
- METEOR
- F1
- ...
- (Perplexity) doesn't tell you anything about generation
- Word embedding based metrics
- Human evaluation
- Word overlap based metrics
Lecture 16: Reference in Language and Coreference Resolution
- slides
Outline
- Coreference Resolution: identify all mentions that refer to the same real world entity
- Application
- Full text understanding
- Machine translation
- Dialogue systems
- Step (Pipelined system)
- Detect the mentions => using other NLP system
- Cluster the mentions
- End-to-end system
- Model
- Rule-based (pronomial anaphora resolution)
- can't solve sentences which have identical syntactic structure
- Mention Pair
- binary classifier: coreferent or not (for every pair of mentions)
- custering
- pick a threshold and add coreference links when above
- take the transitive closure to get the clustering
- Mention Ranking
- assign each mention its highest scoring candidate antecedent
- add dummy NA mention at the front (for decline linking)
- Clustering
- Agglomerative clustering
- start with each mention in its own singleton cluster
- merge a pair of clusters at each step
- Agglomerative clustering
- Rule-based (pronomial anaphora resolution)
- Application
- Mention: span of text referring to some entity
- pronouns
- capture use a part-of-speech tagger
- named entities
- capture use a NER system
- noun phrases
- capture use a parser (especially a constituency parser)
- pronouns
- Linguistics stuff
- Coreference: two mentions refer to the same entity in the world
- Anaphora: when a term (anaphor) refers to another term (antecedent)
- Pronominal Anaphora (Coreferential one)
- Bridging Anaphora (Not Coreferential)
- Cataphora: when antecedent comes after (usually before) the anaphor
Lecture 17: Multitask Learning: A general model for NLP
- slides
Outline
- Natural Language Decathlon (decaNLP)
- => reduce subtask to more general task => transfer knowledge from the other task => maybe then we can do Zero-shot Learning / Transfer Learning
- salesforce/decaNLP: The Natural Language Decathlon: A Multitask Challenge for NLP
- 3 equivalent supertasks of NLP
- Language Modeling
- predict next word
- embedding...
- Question Answering Formalism (Multitask Learning as QA) => Training single question answering model for multiple NLP tasks (aka. questions)
- Question Answering
- Machine Translation
- Summarization
- Natural Language Inference
- Sentiment Classification
- Semantic Role Labeling
- Relation Extraction
- Dialogue
- Semantic Parsing
- Commonsense Reasoning
- Dialogue
- Language Modeling
- Framework for tackling
- more general language understanding
- multitask learning
- domain adaptation
- transfer learning
- weight sharing, pre-training, fine-tuning (towards ImageNet-CNN of NLP)
- zero-shot learning
Assignments
Assignment 1: Exploring Word Vectors
- code
- directory
Outline
- co-occurrance matrix + Truncated SVD
- pre-trained word2vec
Assignment 2: word2vec
- handout
- directory
- written
- code
python3 word2vec.pycheck the correctness of word2vecpython3 sgd.pycheck the correctness of SGD./get_datasets.sh; python3 run.py- training took 9480 seconds
Outline
- Train word2vec with skip-gram model and negative sampling using stochastic gradient descent
Related
Others' Answer
Assignment 3: Dependency Parsing
- handout
- directory
- written
- code
python3 parser_transitions.py part_ccheck the corretness of transition mechanicspython3 parser_transitions.py part_dcheck the correctness of minibatch parsepython3 run.py- set
debug=Trueto test the process (debug_out.log) - set
debug=Falseto train on the entire dataset (train_out.log)- best UAS on the dev set: 88.79 (epoch 9/10)
- best UAS on the test set: 89.27
- set
Outline
- Adam Optimizer
- Dropout
- Neural Transition-based Dependency Parser (a shift-reduce parser)
Others' Answer
Assignment 4: Neural Machine Translation
- handout
- Asure Guide (Google Drive), Practical Guide to VMs (Google Drive)
- directory
- written - BLEU Verify
- A Gentle Introduction to Calculating the BLEU Score for Text in Python
nltk.translate.bleu_score
- Tilde Interactive BLEU score evaluator - input txt
- A Gentle Introduction to Calculating the BLEU Score for Text in Python
- code
python3 sanity_check.py 1dcheck the correctness of encode procedure (including utils.pad_sents)python3 sanity_check.py 1echeck the correctness of decode procedure (including step function)- Preprocess the training data by
sh run.sh vocabto get the necessary vocabulary - Test the functionality on CPU: train
sh run.sh train_local; testsh run.sh test_local- (speed about 100 words/sec on Macbook Air 1.8GHz i5 CPU)
- Train and Test with GPU: train
sh run.sh train; testsh run.sh test- (speed about 5000 words/sec on Nvidia GeForce GTX 1080 GPU)
- (this will generate model image
model.binand optimizers' statemodel.bin.optim) - early stop on
epoch 13, iter 86000, cum. loss 28.94, cum. ppl 5.13 cum. examples 64000=> Corpus BLEU: 22.36579929869114
- Compare output with references
vim -dO outputs/test_outputs.txt en_es_data/test.en - Open three of them at the same time
vim -o outputs/test_outputs.txt en_es_data/test.en en_es_data/test.es
- written - BLEU Verify
Other's Answer
Assignment 5: Character-based Neural Machine Translation
build a character level ConvNet
- handout
- directory
- written
- code
- Create the correct vocab files
sh run.sh vocabvocab_tiny_q1.json: generated vocabulary, source 132 words, target 132 words- source: number of word types: 128, number of word types w/ frequency >= 1: 128
- target: number of word types: 130, number of word types w/ frequency >= 1: 130
vocab_tiny_q2.json: generated vocabulary, source 26 words, target 32 words- source: number of word types: 128, number of word types w/ frequency >= 2: 22
- target: number of word types: 130, number of word types w/ frequency >= 2: 30
vocab.json: generated vocabulary, source 50004 words, target 50002 words- source: number of word types: 172418, number of word types w/ frequency >= 2: 80623
- target: number of word types: 128873, number of word types w/ frequency >= 2: 64215
- Sanity Checks
python3 sanity_check.py [part]- pre-defined: (1e, 1f, 1j, 2a, 2b, 2c, 2d)
- customized: (1g, 1h, 1i, 1j)
- Test the first part code at local
sh run.sh train_local_q1- this will run 100 epochesepoch 100, iter 500, cum. loss 0.31, cum. ppl 1.02 cum. examples 200validation: iter 500, dev. ppl 1.003381
sh run.sh test_local_q1- the model should overfit => Corpus BLEU: 99.29792465574434 (> 99)- this will generate
outputs/test_outputs_local_q1.txt
- this will generate
- Test the second part code at local
sh run.sh train_local_q2epoch 200, iter 1000, cum. loss 0.26, cum. ppl 1.01 cum. examples 200validation: iter 1000, dev. ppl 1.003469
sh run.sh test_local_q2- the model should overfit => Corpus BLEU: 99.29792465574434- this will generate
outputs/test_outputs_local_q2.txt
- this will generate
- Train the model with
sh run.sh trainand test the performance withsh run.sh testepoch 29, iter 196330, avg. loss 90.37, avg. ppl 147.15 cum. examples 10537, speed 3512.25 words/sec, time elapsed 29845.45 secreached maximum number of epochs!=> Corpus BLEU: 24.20035238301319
- Create the correct vocab files
TODO:
- [ ] Enrich the sanity check of the Highway
- [ ] Enrich the sanity check of the CNN
- [ ] Compare the output with Assignment 4 (especially the
<unk>words) - [ ] Written part
Projects
- Project Proposal
- Milestone Instruction
- Project Report
- Project Poster/Video
Question Answering on SQuAD
SQuAD is NOT an Natural Language Generation task. (since the answer is extracted from text.)
Default final project
- handout
- starter code
- directory
Summerization
- Dataset
- Metrics
- Rouge (Recall-Oriented Understudy for Gisting Evaluation)
- with small scale human eval
- Baseline
- Simplest model
- Logistic Regression on unigrams and bigrams
- Averaging word vectors
- Lede-3 baseline
- Simplest model
Book
O'Reilly Natural Language Processing with PyTorch
Recommend in Lecture 11
- joosthub/PyTorchNLPBook: Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media #NLPROC – Natural Language Processing
- Course contents backup
- Software - The Stanford Natural Language Processing Group
- Others' answer
- Luvata/CS224N-2019 (almost finish all the written part as well)
- ZacBi/CS224n-2019-solutions (didn't finish the written part)
- youngmihuang/cs224n_exercise ) (only 2019 a1~a4 coding part)
- Observerspy/CS224n (not fully 2019)
- caijie12138/CS224n-2019 (not quite the assignment)
- ZeyadZanaty/cs224n-assignments (just coding part assignment 2, 3)
PyTorch notes
- Element-wise Product:
A * B,torch.mul(A, B),A.mul(B) - Matrix Multiplication:
A @ B,torch.matmul(A, B),torch.mm,torch.bmm, .... - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
.view()=> error (only on CPU, becausetensor.cuda()automatically makes the tensor contiguous).contiguous().view()=> okay.reshape()=> okay