kd-topic-models
kd-topic-models copied to clipboard
Repo for EMNLP 2020 paper, "Improving Neural Topic Models using Knowledge Distillation"
Improving Neural Topic Models using Knowledge Distillation
Repo for our EMNLP 2020 paper. We will clean up the implementation for improved ease-of-use, but provide the code included in our original submission for the time being.
If you use this code, please use the following citation:
@inproceedings{hoyle-etal-2020-improving,
title = "Improving Neural Topic Models Using Knowledge Distillation",
author = "Hoyle, Alexander Miserlis and
Goel, Pranav and
Resnik, Philip",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.137",
pages = "1752--1771",
}
Rough Steps
-
As of now, you'll need two conda environments to run both the BERT teacher and topic modeling student (which is a modification of Scholar). The environment files are defined in
teacher/teacher.yml
andscholar/scholar.yml
for the teacher and topic model, respectively. For example:conda env create -f teacher/teacher.yml
(edit the first line in theyml
file if you want to change the name of the resulting environment; the default istransformers28
). -
We use the data processing pipeline from Scholar. We'll use the IMDb data to serve as a guide (preprocessing scripts for the Wikitext and 20ng data are also included for replication purposes, but the processing scripts aren't general-purpose):
conda activate scholar
python data/imdb/download_imdb.py
# main preprocessing script
python preprocess_data.py data/imdb/train.jsonlist data/imdb/processed --vocab_size 5000 --test data/imdb/test.jsonlist
# create a dev split from the train data--change filenames if using different data
create_dev_split.py
- Run the teacher model, below is an example using IMDb.
conda activate transformers28
python teacher/bert_reconstruction.py \
--input-dir ./data/imdb/processed-dev \
--output-dir ./data/imdb/processed-dev/logits \
--do-train \
--evaluate-during-training \
--truncate-dev-set-for-eval 120 \
--logging-steps 200 \
--save-steps 1000 \
--num-train-epochs 6 \
--seed 42 \
--num-workers 4 \
--batch-size 20 \
--gradient-accumulation-steps 8 \
--document-split-pooling mean-over-logits
- Collect the logits from the teacher model (the
--checkpoint-folder-pattern
argument accepts grub pattern matching in case you want to create logits for different stages of training; be sure to enclose in double quotes"
)
conda activate transformers28
python teacher/bert_reconstruction.py \
--output-dir ./data/imdb/processed-dev/logits \
--seed 42 \
--num-workers 6 \
--get-reps \
--checkpoint-folder-pattern "checkpoint-9000" \
--save-doc-logits \
--no-dev
- Run the topic model (there are a number of extraneous experimental arguments in
run_scholar.py
, which we intend to strip out in a future version).
conda activate scholar
python scholar/run_scholar.py \
./data/imdb/processed-dev \
--dev-metric npmi \
-k 50 \
--epochs 500 \
--patience 500 \
--batch-size 200 \
--background-embeddings \
--device 0 \
--dev-prefix dev \
-lr 0.002 \
--alpha 0.5 \
--eta-bn-anneal-step-const 0.25 \
--doc-reps-dir ./data/imdb/processed-dev/logits/checkpoint-9000/doc_logits \
--use-doc-layer \
--no-bow-reconstruction-loss \
--doc-reconstruction-weight 0.5 \
--doc-reconstruction-temp 1.0 \
--doc-reconstruction-logit-clipping 10.0 \
-o ./outputs/imdb