ctrl-sum icon indicating copy to clipboard operation
ctrl-sum copied to clipboard

Oracle entity in Table 2 VS. Oracle keywords in Table 7

Open lifelongeek opened this issue 3 years ago • 5 comments

I am trying to reproduce ROUGE on CNNDM with 'oracle keyword in Table 7'. 'oracle entity setting in Table 2' sounds similar to 'oracle keyword in Table 7', however, ROUGE score is very different. Could you explain how these settings are different?

image

lifelongeek avatar Jun 21 '21 02:06 lifelongeek

Hi,

"Oracle entity" in Table 2 uses only the entity words in the groud-truth target, while "oracle keywords" contains non-entity words as well, as described in the paper

jxhe avatar Jun 21 '21 03:06 jxhe

Thanks for the clarification. I have some follow-up questions.

Does example_dataset/test.oraclewordns imply "oracle keywords"? Does "longest sub-sequences" used for training automatic keyword extractor imply "oracle keywords"? image

lifelongeek avatar Jun 21 '21 13:06 lifelongeek

  1. Yes, example_dataset/test.oraclewordns imply "oracle keywords"
  2. The keywords used for training automatic keyword extractor are "oracle keywords", yet strictly speaking "oracle keywords" are not exactly "longest sub-sequences" -- as described in your screenshot, "we remove duplicate words and stop words and keep the remaining tokens as keywords"

jxhe avatar Jul 28 '21 08:07 jxhe

Hi,

I have a quick follow-up question on this point. For 'oracle entities', which NER tool did you used for extacting oracle entities from the reference summary?

Thanks a lot!!

Wendy-Xiao avatar Jul 04 '22 23:07 Wendy-Xiao

Hi, we use stanza for NER, you may refer to some examples here: https://github.com/salesforce/ctrl-sum/blob/6468beaaceebf463b492992fffef0e4f693a3281/scripts/preprocess.py#L890

jxhe avatar Jul 05 '22 13:07 jxhe