LexisNexisTools
                                
                                
                                
                                    LexisNexisTools copied to clipboard
                            
                            
                            
                        Extract paragraphs from news articles
Hi,
thank you for the great R-package. I really like it. However, I have run into some issues when trying to extract paragraphs from news articles.
I need the news articles to be in paragraphs. So, I ran this following code:
Transform into paragraphs
paragraphs_df <- lnt_convert(LNToutput, to = "data.frame", what = "Paragraphs")
When I look at the results, some of the paragraphs are formed with sentences cut in half. For example, the paragraph "The core consumer price index the CPI minus volatile food and energy prices rose just 2.2% in 1997. That's down from a 2.6% rise in 1996 and the smallest gain since 1965 when it rose 1.5 %." was split into three different lines in paragraphs_df.
"The core consumer price index the CPI minus volatile food and" "energy prices rose just 2.2% in 1997. That's down from a 2.6%" "rise in 1996 and the smallest gain since 1965 when it rose 1.5 %."
Is there a way to fix this? How can I get the paragraphs from articles?
Thanks.
Could it be that the code needs an additional if-statement to check whether the paragraph ends with a period?
I can't really check without the specific file. But I chaned the behaviour of lnt_read a little to include proper line breaks into the output.
If your files are formatted:
The core consumer price index the CPI minus volatile food and energy prices rose just 2.2% in 1997. That's down from a 2.6% rise in 1996 and the smallest gain since 1965 when it rose 1.5 %.
This will now show up as
"The core consumer price index the CPI minus volatile food and\n"energy prices rose just 2.2% in 1997. That's down from a 2.6%\n"rise in 1996 and the smallest gain since 1965 when it rose 1.5 %."
Which should make it easier to debug.