long-summarization icon indicating copy to clipboard operation
long-summarization copied to clipboard

Unable to reproduce results

Open alexgaskell10 opened this issue 4 years ago • 3 comments

I have been unable to reproduce the results shown in the paper. I have trained the model for 20k steps and the loss has fallen nicely throughout training. When I produce the output summaries, however (using mode=decode- I presume this is correct?) they are not good. An illustrative output summary is shown below. If I resume from that checkpoint and train the model further, it says loss is NAN and stops training.

What am I missing here? The command I use to train the model is:

python run_summarization.py
--mode=train
--data_path=$DATA_DIR/train.bin
--vocab_path=$DATA_DIR/vocab
--log_root=logroot
--exp_name=exp
--max_dec_steps=210
--max_enc_steps=2500
--num_sections=5
--max_section_len=500
--batch_size=1
--vocab_size=50000
--use_do=True
--optimizer=adagrad
--do_prob=0.25
--hier=True
--split_intro=True
--fixed_attn=True
--legacy_encoder=False
--coverage=False
--lr=0.05

Illustrative example background : of of . . the the under of either of the private public private medicine medicine private private other has successfully investigated . here we by case of this by in first chronic chronic of of the the private of the patients [UNK] 75 symptoms causing the . it history . this is method the first successful chronic mortality.19 without chronic chronic of . , [ , the condition the percentage of adult . without mortality.19 mortality.19 the with without without without without without of without without without without of other private private the other . results results the would suggest and identifying private private private of malignancy improve increases . we also demonstrated the susceptibility and new new elderly this report chronic chronic . chronic of of with with asthma4 without asthma4 the significantly higher . there , it it greater greater than .

alexgaskell10 avatar Jun 26 '20 12:06 alexgaskell10

I think at 20K steps the model is still undertrained. I suggest starting with a smaller section length and number of sections and then at final steps increasing those. Something like --max_section_len=400, --num_sections=4, --max_dec_steps=100, -max_enc_steps=1600

armancohan avatar Jun 27 '20 17:06 armancohan

Thanks for coming back to me. 2 follow-up questions:

  1. Start training from scratch with this setup or resume from my latest checkpoint?
  2. Will this method help prevent training being corrupted with loss being NAN?

alexgaskell10 avatar Jun 29 '20 08:06 alexgaskell10

I would start from scratch. I also remember seeing some nan issues although this was a while ago (as far as I recall nan's were more likely to occur in longer sequences).

armancohan avatar Jun 30 '20 01:06 armancohan