WikiHow-Dataset icon indicating copy to clipboard operation
WikiHow-Dataset copied to clipboard

Summary that are not actually summary

Open astariul opened this issue 5 years ago • 2 comments

Because of how WikiHow present its article :

Summary

Text

Some article are written in a way where what's supposed to be the summary is actually the main content, and what's supposed to be the text is just precision.

For example :

Be well-mannered, patient and sensitive to other people's feelings and opinions/beliefs: Saying, "You have a right to your own opinion."

will often lower the harshness of the argument of the two sides.


This is a problem from the data itself. My question is :

Is there any efficient way to detect such sample ? In order to remove it.

For example, using some ROUGE score between the summary and the text. But this might not be a good solution, because some summary are too abstractive to have a good ROUGE score...

Any idea ?

astariul avatar Mar 21 '19 06:03 astariul

According to the procedures on how to write an article on Wikihow, each line needs to be followed by a least a few sentences to further describe the first line (Hopefully, many writers follow the guidelines to write their articles). However, there is no guarantee that all the articles in the dataset follow the same guidelines. Therefore, these cases do exist in the data. Using the threshold to remove short summaries (as stated in issue #12) can help remove these cases as they are usually not followed by long sentences. But I don’t believe there is an easy way to measure how well the summary sentences can summarize the articles except for manual evaluation of all pairs.

mahnazkoupaee avatar Mar 21 '19 17:03 mahnazkoupaee

Thanks for your insight.

Let's keep it open in case someone come with an idea :)

astariul avatar Mar 21 '19 23:03 astariul