LexisNexisTools icon indicating copy to clipboard operation
LexisNexisTools copied to clipboard

Extract paragraphs from news articles

Open Jh00ni opened this issue 2 years ago • 1 comments

Hi,

thank you for the great R-package. I really like it. However, I have run into some issues when trying to extract paragraphs from news articles.

I need the news articles to be in paragraphs. So, I ran this following code:

Transform into paragraphs paragraphs_df <- lnt_convert(LNToutput, to = "data.frame", what = "Paragraphs")

When I look at the results, some of the paragraphs are formed with sentences cut in half. For example, the paragraph "The core consumer price index the CPI minus volatile food and energy prices rose just 2.2% in 1997. That's down from a 2.6% rise in 1996 and the smallest gain since 1965 when it rose 1.5 %." was split into three different lines in paragraphs_df.

"The core consumer price index the CPI minus volatile food and" "energy prices rose just 2.2% in 1997. That's down from a 2.6%" "rise in 1996 and the smallest gain since 1965 when it rose 1.5 %."

Is there a way to fix this? How can I get the paragraphs from articles?

Thanks.

Jh00ni avatar Aug 21 '23 13:08 Jh00ni

Could it be that the code needs an additional if-statement to check whether the paragraph ends with a period?

Jh00ni avatar Aug 21 '23 14:08 Jh00ni

I can't really check without the specific file. But I chaned the behaviour of lnt_read a little to include proper line breaks into the output.

If your files are formatted:

The core consumer price index the CPI minus volatile food and energy prices rose just 2.2% in 1997. That's down from a 2.6% rise in 1996 and the smallest gain since 1965 when it rose 1.5 %.

This will now show up as

"The core consumer price index the CPI minus volatile food and\n"energy prices rose just 2.2% in 1997. That's down from a 2.6%\n"rise in 1996 and the smallest gain since 1965 when it rose 1.5 %."

Which should make it easier to debug.

JBGruber avatar Apr 14 '24 19:04 JBGruber