PreSumm
PreSumm copied to clipboard
code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
PreSumm
This code is for EMNLP 2019 paper Text Summarization with Pretrained Encoders
Updates Jan 22 2020: Now you can Summarize Raw Text Input!. Swith to the dev branch, and use -mode test_text and use -text_src $RAW_SRC.TXT to input your text file. Please still use master branch for normal training and evaluation, dev branch should be only used for test_text mode.
- abstractive use -task abs, extractive use -task ext
- use
-test_from $PT_FILE$to use your model checkpoint file. - Format of the source text file:
- For abstractive summarization, each line is a document.
- If you want to do extractive summarization, please insert
[CLS] [SEP]as your sentence boundaries.
- There are example input files in the raw_data directory
- If you also have reference summaries aligned with your source input, please use
-text_tgt $RAW_TGT.TXTto keep the order for evaluation.
Results on CNN/DailyMail (20/8/2019):
| Models | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Extractive | |||
| TransformerExt | 40.90 | 18.02 | 37.17 |
| BertSumExt | 43.23 | 20.24 | 39.63 |
| BertSumExt (large) | 43.85 | 20.34 | 39.90 |
| Abstractive | |||
| TransformerAbs | 40.21 | 17.76 | 37.09 |
| BertSumAbs | 41.72 | 19.39 | 38.76 |
| BertSumExtAbs | 42.13 | 19.60 | 39.18 |
Python version: This code is in Python3.6
Package Requirements: torch==1.1.0 pytorch_transformers tensorboardX multiprocess pyrouge
Updates: For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.
Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)
Trained Models
System Outputs
Data Preparation For XSum
Data Preparation For CNN/Dailymail
Option 1: download the processed data
unzip the zipfile and put all .pt files into bert_data
Option 2: process the data yourself
Step 1 Download Stories
Download and unzip the stories directories from here for both CNN and Daily Mail. Put all .story files in one directory (e.g. ../raw_stories)
Step 2. Download Stanford CoreNLP
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2017-06-09 directory.
Step 3. Sentence Splitting and Tokenization
python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH
RAW_PATHis the directory containing story files (../raw_stories),JSON_PATHis the target directory to save the generated json files (../merged_stories_tokenized)
Step 4. Format to Simpler Json Files
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH
RAW_PATHis the directory containing tokenized files (../merged_stories_tokenized),JSON_PATHis the target directory to save the generated json files (../json_data/cnndm),MAP_PATHis the directory containing the urls files (../urls)
Step 5. Format to PyTorch Files
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log
JSON_PATHis the directory containing json files (../json_data),BERT_DATA_PATHis the target directory to save the generated binary files (../bert_data)
Model Training
First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use -visible_gpus -1, after downloading, you could kill the process and rerun the code with multi-GPUs.
Extractive Setting
python train.py -task ext -mode train -bert_data_path BERT_DATA_PATH -ext_dropout 0.1 -model_path MODEL_PATH -lr 2e-3 -visible_gpus 0,1,2 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -train_steps 50000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 10000 -max_pos 512
Abstractive Setting
TransformerAbs (baseline)
python train.py -mode train -accum_count 5 -batch_size 300 -bert_data_path BERT_DATA_PATH -dec_dropout 0.1 -log_file ../../logs/cnndm_baseline -lr 0.05 -model_path MODEL_PATH -save_checkpoint_steps 2000 -seed 777 -sep_optim false -train_steps 200000 -use_bert_emb true -use_interval true -warmup_steps 8000 -visible_gpus 0,1,2,3 -max_pos 512 -report_every 50 -enc_hidden_size 512 -enc_layers 6 -enc_ff_size 2048 -enc_dropout 0.1 -dec_layers 6 -dec_hidden_size 512 -dec_ff_size 2048 -encoder baseline -task abs
BertAbs
python train.py -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2 -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3 -log_file ../logs/abs_bert_cnndm
BertExtAbs
python train.py -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2 -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3 -log_file ../logs/abs_bert_cnndm -load_from_extractive EXT_CKPT
EXT_CKPTis the saved.ptcheckpoint of the extractive model.
Model Evaluation
CNN/DM
python train.py -task abs -mode validate -batch_size 3000 -test_batch_size 500 -bert_data_path BERT_DATA_PATH -log_file ../logs/val_abs_bert_cnndm -model_path MODEL_PATH -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../logs/abs_bert_cnndm
XSum
python train.py -task abs -mode validate -batch_size 3000 -test_batch_size 500 -bert_data_path BERT_DATA_PATH -log_file ../logs/val_abs_bert_cnndm -model_path MODEL_PATH -sep_optim true -use_interval true -visible_gpus 1 -max_pos 512 -min_length 20 -max_length 100 -alpha 0.9 -result_path ../logs/abs_bert_cnndm
-modecan be {validate, test}, wherevalidatewill inspect the model directory and evaluate the model for each newly saved checkpoint,testneed to be used with-test_from, indicating the checkpoint you want to useMODEL_PATHis the directory of saved checkpoints- use
-mode valiadtewith-test_all, the system will load all saved checkpoints and select the top ones to generate summaries (this will take a while)