NLP-power-analysis
NLP-power-analysis copied to clipboard
Power analysis for NLP
This repo exists to accompany the paper With Little Power Comes Great Responsibility, by Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky, to be published at EMNLP 2020.
Here you can find the raw annotations described in the paper, code to reproduce most of the figures, and notebooks which can be adapted for running your own power analyses, including online interactive notebooks.
Online notebooks
To accompany the paper, we have provided some online notebooks on Google Colaboratory which can be used as starter code for running simple power analyses. These can be copied and run in an interactive manner.
At the moment, we have notebooks for the following scenarios:
- Power analysis for accuracy comparisons
- Power analysis BLEU score comparisons
- Fitting a mixed-effects model to one of the public human evaluation datasets used in the paper (see below for additional datasets)
- Power analysis for Likert scale human evaluation data using mixed effects models
Note that some of these Google colaboratory notebooks will require you to grant access such that they can read from your Google Drive. To do so, run the cell that contains drive.mount('/content/drive')
, click the link to get an access key, and then paste it into the cell.
These notebooks are also included locally in notebooks_for_power_calculations
-- the first two as ipython notebooks, and the latter two as R scripts. Note that these are suggested as a starting point, but will require some consideration to apply to other cases.
Local Requirements
All code in this repo is presented in terms of notebooks and scripts in python and R. You can install the required python components using the requirements.txt
file. Alternatively, if using conda, you can create an environment with the required packages, and then load a notebook using:
conda create -n nlp-power python=3
conda activate nlp-power
conda install numpy scipy matplotlib ipython notebook tqdm pandas statsmodels seaborn
ipython notebook
Comparing Models on Accuracy
- Starter code for running power analyses for comparing classifiers (in terms of accuracy, using hypothesized values for
and
) is provided in
notebooks_for_power_calculations/accuracy.ipynb
(also available as an online interactive notebook). - Code to reproduce the related figures (3, 7, and 8) is provided in
code_for_figures
.
GLUE and SQuAD 2.0 analyses
- Code and data for analyzing the reported gains extracted from papers (and the SQuAD 2.0 leaderboard) have been included. To run this online, it is first necessary to upload the two files in
data/GLUE/
to your Google Drive, and making a copy of this online notebook (also included in this repo atanalyses/GLUE/Analyze_reported_results.ipynb
). - Code for predicting overlap (used in estimating minimum detectable effect sizes is also included at
analyses/GLUE/Predict_Overlap_From_Glue.ipynb
, but cannot be run, as it depends on test set predictions on the GLUE benchmark which cannot be shared.
Additional SQuAD v2.0 analyses
- Pairwise model results for SQuAD 2.0 on validation data is shared here.
data/SQuAD2/models.tsv
contains the models that had been submitted up to the time of writing (dev and test EM)data/pairs_dev.tsv
contains pairwise validation results (difference in number correct and number of disagreements). - Code for the analysis reported in Appendix D is included in
analyses/SQuAD2/Explore_squad_data.ipynb
but cannot be run, as it depends on pairwise results on test data data, which cannot be shared.
Machine Translation
- Code for computing power on BLEU scores (using hypothesized values for
,
, and
is provided in
notebooks_for_power_calculations/BLEU.ipynb
(also available as an online interactive notebook). - Code for estimating the Laplace-Delta mixture parameters from data is provided in
code_for_figures/Figure 15 (and 16).ipynb
Human Evaluation
There are several components to the human evaluation materials:
- The notebooks in
data_import
can be used to download and preprocess the publicly-available human evaluation datasets used in this paper. This is a necessary first step before running most of the R scripts below. - Code for fitting mixed effects models to these datasets is included in
notebooks_for_power_calculations/estimate_parameters_from_datasets.R
- One combined example (loading the data and fitting a model) is included as an online notebook
- The code we have used for power simulations is provided as
notebooks_for_power_calculations/power_sim.R
, and is also provided as an online notebook - Finally, the meta-analysis of EMNLP 2019 results is provided in
analyses/human_eval/meta_analysis_submit.R
, with code for Figures 5 and 6 provided in thecode_for_figures
directory.
Code for Figures
- Code to reproduce figures from the paper which are based on simulations and/or public data are provided as notebooks in
code_for_figures
.
Reference
@inproceedings{card2020power,
title = {With Little Power Comes Great Responsibility},
author = {Dallas Card and Peter Henderson and Urvashi Khandelwal and Robin Jia and Kyle Mahowald and Dan Jurafsky},
booktitle = {Proceedings of EMNLP},
year = 2020,
url = {https://arxiv.org/abs/2010.06595},
}