m2d2
m2d2 copied to clipboard
M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer
M2D2: A Massively Multi-domain Language Modeling Dataset
Scripts and data links for M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, and Luke Zettlemoyer.
Data
Update: The data is currently hosted on HuggingFace here!
To load the dataset use the following steps:
pip install --upgrade datasets
import datasets
dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice
print(dataset['train'][0]['text']
~~We're currently exploring ways to host this large amount of data online in an accessible manner, so please stay tuned! If you would like to access sooner, feel free to reach out at {machelreid}-{at}-{google-dot-com}.~~
Evaluation Sets
Feel free to download the test sets for all domains at this Google Drive link.
or via gdown
:
#!/bin/bash
# install and/or upgrade gdown with pip
pip install --upgrade gdown
# Download M2D2 test sets
gdown "1U5wki_V-IFQy733HC6NO5ZuM2jaOaw8y"
tar -xvzf m2d2_test_sets.tar.gz
# File structure
# m2d2_test_sets/
# ├─ DOMAIN_AA/
# │ ├─ test.txt
# ├─ DOMAIN_AB/
# │ ├─ test.txt/
Reproduction Scripts for Modeling
Find scripts for finetuning language models in lm_scripts/adapt.sh
. Furthermore, we provide meta-scripts for generating scripts for multiple domains given an input file containing a list of directories containing domain specfici data (within train.txt
and valid.txt
should exist): lm_scripts/generate_multiple.sh
. Respective instructions/parameters are included in each file.
For validation on multiple files, we also include lm_scripts/validate_on_multiple_files.py
for calculating perplexity measures given a file containing a list of evaluation text files and a model checkpoint.
Helper Scripts for Wikipedia Data Collection
For Wikipedia data collection, we include scripts for data dump processing (data_scripts/wiki/get_data
), ontology gathering (data_scripts/wiki/ontology
), and generating splits (data_scripts/wiki/split_generation
).
Helper Scripts for S2ORC Data Collection
To be uploaded with documentation
Scripts to reproduce analyses in the paper
To be uploaded with documentation