structured-neural-summarization CNNDailymail data: Predicted summaries are lists of single words and lead to rouge score of zero

Thank you very much for providing the latest updates to the repo. I am still having trouble training the model on a small subset of CNNDailymail data. Upon inferencing, the model keeps producing predictions that are lists of single words. I am providing more details below:

How I ran the code:

train_and_eval.py --infer_source_file /home/shan/datasets/NLP/dev_CNNDM_sequenceGGNN/jsonl/test/inputs.jsonl.gz --infer_predictions_file /home/shan/datasets/NLP/dev_CNNDM_sequenceGGNN/jsonl/test/predictions.jsonl

The spurious single-word predictions:

Validation predictions...
[['at'], ['at'], ['at'], ['at'], ['5.3million'], ['5.3million'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['5.3million'], ['at'], ['at'], ['at'], ['5.3million'], ['at'], ['5.3million'], ['5.3million'], ['at'], ['rehahn'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['at'], ['rehahn'], ['at'],
(rest of stdout omitted)

Whereas the target summaries should have been parsed properly. For example:

Targets...
[['Lord', 'Mervyn', 'Davies,', '62,', 'was', 'at', 'a', 'Royal', 'Academy', 'of', 'Arts', 'party', 'last', 'night.', 'Singer', 'Usher', 'had', 'been', 'speaking', 'to', 'group', 'of', 'young', 'people', 'at', 'charity', 'event.', 'Labour', 'peer', 'showed', 'off', 'his', 'fancy', 'footwork', 'on', 'the', 'dance', 'floor.', 'Usher', 'will', 'finish', 'his', 'tour', 'with', 'a', 'concert', 'at', 'the', 'O2', 'tonight', '.'], ['Craig', 'MacLean,', '22,', 'was', 'on', 'flight', 'to', 'Abu', 'Dhabi', 'when', 'staff', 'called', 'for', 'doctor.', 'The', 'medical', 'student', 'stepped', 'in', 'to', 'help', 'when', 'man', 'suffered', 'a', 'cardiac', 'arrest.', 'Dundee', 'University', 'student', 'started', 'trying', 'to', 'revive', 'the', 'passenger', 'at', '36,000ft.', 'KLM', 'flight', 'from', 'Scotland', 'diverted', 'to', 'Turkey', 'and', 'man', 'received', 'medical', 'care.'],
(and so on)

The error messages: The rouge score ends up being zero and the training quickly reports error:

eval loss: 8.41, eval rouge: 0.00
early stopping triggered...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/workspace/GGNN_text_summarizer/train_and_eval.py in <module>
    625 
    626 if __name__ == "__main__":
--> 627     main()

~/workspace/GGNN_text_summarizer/train_and_eval.py in main()
    212 
    213     if args.infer_source_file is not None:
--> 214         infer(model, args)
    215 
    216 

~/workspace/GGNN_text_summarizer/train_and_eval.py in infer(model, args)
    487         # saver = tf.train.Saver(max_to_keep=100)
    488         saver = tf.train.Saver(max_to_keep=1)
--> 489         saver.restore(session, os.path.join(args.checkpoint_dir, "best.ckpt"))
    490 
    491         # build eval graph, loss and prediction ops

~/software/anaconda3/envs/tensorflow/lib/python3.7/site-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path)
   1266     if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)):
   1267       raise ValueError("The passed save_path is not a valid checkpoint: "
-> 1268                        + compat.as_text(save_path))
   1269 
   1270     logging.info("Restoring parameters from %s", compat.as_text(save_path))

ValueError: The passed save_path is not a valid checkpoint: cnndailymail_summarizer/best.ckpt

Would you mind providing some insights on what might have caused this issue? Thanks!

May 02 '19 14:05 shandou

Hmm thats weird, did you run into similar problems @ioana-blue ?

May 03 '19 20:05 CoderPat

No, right now I get coherent messages that are not aligned with the code, so I get about .14 rouge-2. however I'm trying this for a tiny dataset (about 10k training samples).

May 03 '19 23:05 ioana-blue

Thank you very much for getting back to me 😃 It could be that both my training data and the number of training iterations are too small (as I want to make sure that I am using the entire pipeline properly). @ioana-blue @CoderPat In your opinion:

If you have to provide a ballpark estimate: What is the minimum viable training data size for the NLP task?
How many training steps would you need to start to get sensible predictions?
Are the default hyperparameters in train_and_eval a good starting point for the NLP task?

I also find coreNLP annotation a pretty intense process computationally. Wonder if you could provide insights as to why this is the case and if structural annotations, in general, are expected to be this intense.

Thanks a lot!!

May 04 '19 00:05 shandou

Even though a good amount of data is necessary for good results you shouldn't be seeing always the same words. They were for the full dataset, but one thing I've noticed about graph models is that they are much more sensible to hyperparameter optimizations. I think I'll have some free time in the upcoming weeks so I'll try to retrain the model on the full cnn_dailymail dataset to see if I catch any more bugs and upload a checkpoint I will try to upload a checkpoint

May 04 '19 00:05 CoderPat

Great!! I'll also tinker more in parallel and check with you again later. Thanks a lot! :)

May 04 '19 05:05 shandou

@shandou I think found the problem, is some weird issue with tensorflow not checkpointing some variables (might be caused by a new version of tensorflow). I assume @ioana-blue doesnt get it since she does inference in the same run as training. while try to investigate soon and fix it

May 09 '19 05:05 CoderPat

Many thanks for looking into this! I haven't been able to spend a lot of time on the codes this week 😞 but plan to come back to it this weekend. Please keep me posted!

May 09 '19 05:05 shandou

That's right, so far I've been doing inference in the same run. But at some point it would be nice to do inference after loading a checkpoint. In fact, I did run it like this but only for debugging (not looking at overall accuracy).

May 09 '19 11:05 ioana-blue

hello，I want to discuss some issues with you,，Can I talk to you privately?Do you have an email or WeChat? thank you~

May 11 '19 07:05 shellycsy

Sure, it is in my GitHub profile

May 12 '19 12:05 CoderPat

I have the same problem as you: eval loss: 7.43, eval rouge: 0.00 early stopping triggered...

Can you solve this problem?

May 25 '19 08:05 shellycsy

structured-neural-summarization structured-neural-summarization copied to clipboard

CNNDailymail data: Predicted summaries are lists of single words and lead to rouge score of zero

structured-neural-summarization
structured-neural-summarization copied to clipboard