long-summarization Unable to reproduce results

I have been unable to reproduce the results shown in the paper. I have trained the model for 20k steps and the loss has fallen nicely throughout training. When I produce the output summaries, however (using mode=decode- I presume this is correct?) they are not good. An illustrative output summary is shown below. If I resume from that checkpoint and train the model further, it says loss is NAN and stops training.

What am I missing here? The command I use to train the model is:

python run_summarization.py
--mode=train
--data_path=$DATA_DIR/train.bin
--vocab_path=$DATA_DIR/vocab
--log_root=logroot
--exp_name=exp
--max_dec_steps=210
--max_enc_steps=2500
--num_sections=5
--max_section_len=500
--batch_size=1
--vocab_size=50000
--use_do=True
--optimizer=adagrad
--do_prob=0.25
--hier=True
--split_intro=True
--fixed_attn=True
--legacy_encoder=False
--coverage=False
--lr=0.05

Illustrative example background : of of . . the the under of either of the private public private medicine medicine private private other has successfully investigated . here we by case of this by in first chronic chronic of of the the private of the patients [UNK] 75 symptoms causing the . it history . this is method the first successful chronic mortality.19 without chronic chronic of . , [ , the condition the percentage of adult . without mortality.19 mortality.19 the with without without without without without of without without without without of other private private the other . results results the would suggest and identifying private private private of malignancy improve increases . we also demonstrated the susceptibility and new new elderly this report chronic chronic . chronic of of with with asthma4 without asthma4 the significantly higher . there , it it greater greater than .

Jun 26 '20 12:06 alexgaskell10

I think at 20K steps the model is still undertrained. I suggest starting with a smaller section length and number of sections and then at final steps increasing those. Something like --max_section_len=400, --num_sections=4, --max_dec_steps=100, -max_enc_steps=1600

Jun 27 '20 17:06 armancohan

Thanks for coming back to me. 2 follow-up questions:

Start training from scratch with this setup or resume from my latest checkpoint?
Will this method help prevent training being corrupted with loss being NAN?

Jun 29 '20 08:06 alexgaskell10

I would start from scratch. I also remember seeing some nan issues although this was a while ago (as far as I recall nan's were more likely to occur in longer sequences).

Jun 30 '20 01:06 armancohan

long-summarization long-summarization copied to clipboard

Unable to reproduce results

long-summarization
long-summarization copied to clipboard