nlp_classification
nlp_classification copied to clipboard
Implementing nlp papers relevant to classification with PyTorch, gluonnlp
NLP paper implementation relevant to classification with PyTorch
The papers were implemented in using korean corpus
Prelimnary & Usage
- preliminary
pyenv virualenv 3.7.7 nlp
pyenv activate nlp
pip install -r requirements.txt
- Usage
python build_dataset.py
python build_vocab.py
python train.py # default training parameter
python evaluate.py # defatul evaluation parameter
Single sentence classification (sentiment classification task)
- Using the Naver sentiment movie corpus v1.0 (a.k.a.
nsmc
) - Configuration
-
conf/model/{type}.json
(e.g.type = ["sencnn", "charcnn",...]
) -
conf/dataset/nsmc.json
-
- Structure
# example: Convolutional_Neural_Networks_for_Sentence_Classification
├── build_dataset.py
├── build_vocab.py
├── conf
│ ├── dataset
│ │ └── nsmc.json
│ └── model
│ └── sencnn.json
├── evaluate.py
├── experiments
│ └── sencnn
│ └── epochs_5_batch_size_256_learning_rate_0.001
├── model
│ ├── data.py
│ ├── __init__.py
│ ├── metric.py
│ ├── net.py
│ ├── ops.py
│ ├── split.py
│ └── utils.py
├── nsmc
│ ├── ratings_test.txt
│ ├── ratings_train.txt
│ ├── test.txt
│ ├── train.txt
│ ├── validation.txt
│ └── vocab.pkl
├── train.py
└── utils.py
Model \ Accuracy | Train (120,000) | Validation (30,000) | Test (50,000) | Date |
---|---|---|---|---|
SenCNN | 91.95% | 86.54% | 85.84% | 20/05/30 |
CharCNN | 86.29% | 81.69% | 81.38% | 20/05/30 |
ConvRec | 86.23% | 82.93% | 82.43% | 20/05/30 |
VDCNN | 86.59% | 84.29% | 84.10% | 20/05/30 |
SAN | 90.71% | 86.70% | 86.37% | 20/05/30 |
ETRIBERT | 91.12% | 89.24% | 88.98% | 20/05/30 |
SKTBERT | 92.20% | 89.08% | 88.96% | 20/05/30 |
- [x] Convolutional Neural Networks for Sentence Classification (as SenCNN)
- https://arxiv.org/abs/1408.5882
- [x] Character-level Convolutional Networks for Text Classification (as CharCNN)
- https://arxiv.org/abs/1509.01626
- [x] Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers (as ConvRec)
- https://arxiv.org/abs/1602.00367
- [x] Very Deep Convolutional Networks for Text Classification (as VDCNN)
- https://arxiv.org/abs/1606.01781
- [x] A Structured Self-attentive Sentence Embedding (as SAN)
- https://arxiv.org/abs/1703.03130
- [x] BERT_single_sentence_classification (as ETRIBERT, SKTBERT)
- https://arxiv.org/abs/1810.04805
Pairwise-text-classification (paraphrase detection task)
- Creating dataset from https://github.com/songys/Question_pair
- Configuration
-
conf/model/{type}.json
(e.g.type = ["siam", "san",...]
) -
conf/dataset/qpair.json
-
- Structure
# example: Siamese_recurrent_architectures_for_learning_sentence_similarity
├── build_dataset.py
├── build_vocab.py
├── conf
│ ├── dataset
│ │ └── qpair.json
│ └── model
│ └── siam.json
├── evaluate.py
├── experiments
│ └── siam
│ └── epochs_5_batch_size_64_learning_rate_0.001
├── model
│ ├── data.py
│ ├── __init__.py
│ ├── metric.py
│ ├── net.py
│ ├── ops.py
│ ├── split.py
│ └── utils.py
├── qpair
│ ├── kor_pair_test.csv
│ ├── kor_pair_train.csv
│ ├── test.txt
│ ├── train.txt
│ ├── validation.txt
│ └── vocab.pkl
├── train.py
└── utils.py
Model \ Accuracy | Train (6,136) | Validation (682) | Test (758) | Date |
---|---|---|---|---|
Siam | 93.00% | 83.13% | 83.64% | 20/05/30 |
SAN | 89.47% | 82.11% | 81.53% | 20/05/30 |
Stochastic | 89.26% | 82.69% | 80.07% | 20/05/30 |
ETRIBERT | 95.07% | 94.42% | 94.06% | 20/05/30 |
SKTBERT | 95.43% | 92.52% | 93.93% | 20/05/30 |
- [x] A Structured Self-attentive Sentence Embedding (as SAN)
- https://arxiv.org/abs/1703.03130
- [x] Siamese recurrent architectures for learning sentence similarity (as Siam)
- https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12195
- [x] Stochastic Answer Networks for Natural Language Inference (as Stochastic)
- https://arxiv.org/abs/1804.07888
- [x] BERT_pairwise_text_classification (as ETRIBERT, SKTBERT)
- https://arxiv.org/abs/1810.04805