hindi-nli-code

Implementation of the AACL-IJCNLP 2020 paper: Two-Step Classification using Recasted Data for Low Resource Settings.

recasted-samples

Requirements

All the code in this repo is built with PyTorch.

python3.5+
pytorch1.4.0
numpy
pdb

Data

All the data used for experimentation is available at hindi-nli-data with train, test and development set splits.

After downloading the data, use the arguments train_data , test_data and val_data in the scripts in order to point to the directory containing the respective .tsv files.

Training

To independently train the Textual Entailment model (TE) without the joint objective, use

python nli_train.py

To train the Textual Entailment model along with Two-Step Classification (i.e. with the joint objective - TE + JO), use

python nli_train_joint.py

In order to train using the consistency regularization technique (+CR), use the argument is_cr=True, else turn is_cr=False.

To train the Direct Classification model, use

python clf_train.py

Testing

To evaluate the accuracy of the trained models for both Textual Entailment and Classification, run the script python evaluate.py in their respective folders.

To evaluate the inconsistency results, run the script python inconsistency.py in the Textual Entailment folder.

To evaluate the comparison results between Direct Classification and Two-Step Classification approaches, run the script python comparison.py in the Textual Entailment folder.

For results in the semi-supervised setting (appendix), use the desired percentage from the training data without modifying test and dev sets.

Following is a guide to the command line arguments that can help training with the desired setting:

train_data - Dataset directory followed by the file containing training data
test_data - Dataset directory followed by the file containing test data
val_data - Dataset directory followed by the file containing validation data
n_classes_clf - Number of classes in the original classification task of the dataset being used
max_train_sents - Maximum number of training examples
max_test_sents - Maximum number of testing examples
max_val_sents - Maximum number of validation examples
n_epochs - Number of epochs to run the training for
n_classes - Number of classes for the textual entailment task, which is 2 irrespective of the dataset (entailed and not-entailed)
n-sentiment - Number of classes for the classification task
batch_size - Number of data samples in the batch for each iteration
dpout_model - Dropout rate for the encoder network
dpout_fc - Dropout rate for the classifier network
optimizer - To choose the type of the optimizer for training (SGD or Adam)
lr_shrink - Shrink factor for SGD
decay - Decay factor for learning rate
minlr - Minimum learning rate
is_cr - True for training with consistency regularization, otherwise False
embedding_size - Embedding size of the sentence embedding model used
max_norm - Maximum norm for the gradients

Bibliograhy

If ouy use our dataset or code, please cite using

@inproceedings{uppal-etal-2020-two,
    title = "Two-Step Classification using Recasted Data for Low Resource Settings",
    author = "Uppal, Shagun  and
      Gupta, Vivek  and
      Swaminathan, Avinash  and
      Zhang, Haimin  and
      Mahata, Debanjan  and
      Gosangi, Rakesh  and
      Shah, Rajiv Ratn  and
      Stent, Amanda",
    booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.aacl-main.71",
    pages = "706--719",
    abstract = "An NLP model{'}s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.",

hindi-nli-code
hindi-nli-code copied to clipboard

Metadata

hindi-nli-code

Requirements

Data

Training

Testing

Bibliograhy

← Metadata

Owner

Metadata

hindi-nli-code hindi-nli-code copied to clipboard

Metadata

hindi-nli-code

Requirements

Data

Training

Testing

Bibliograhy

← Metadata

Owner

Metadata

hindi-nli-code
hindi-nli-code copied to clipboard