Parser Returns Wrong Article-body on Multi-Article Pages
Issue by xoffey
Thu Jul 26 00:07:22 2018
Originally opened as https://github.com/codelucas/newspaper/issues/601
In a dataset that included 986 articles from LA Times, 443 (44.9%) of the LA Times articles contained parse errors that caused the wrong body text to be found. We hand-analyzed the first 20 articles which contained 12 Wrong-Body errors. Of those, 10 occurred on Article Pages that contain other articles following the lead article (Multi-Article Pages), so we estimate that about 83% of the parse errors (estimated 370 out of 986 articles) occurred on Multi-Article Pages.
Here is a detailed example of a page with this issue:
url = http://www.latimes.com/entertainment/la-et-entertainment-news-updates-a-star-is-born-maggie-gyllenhaal-turns-1510554633-htmlstory.html
title = A Star Is Born: Maggie Gyllenhaal turns 40 today
body = Julia Louis-Dreyfus in "Veep." (Lacey Terrell / HBO) Production on the seventh and final season of HBO's "Veep" has been postponed while its star, Julia Louis-Dreyfus, undergoes treatment for breast cancer. "We're obviously postponing production of the show. We were supposed to have started now, while she's in treatment," journalist Frank Rich, who is an executive producer on the Emmy-winning HBO series, said on SiriusXM's "Press Pool" on Wednesday. "But the expectation is that we will shoot again. We have one more season we're doing, which we're incredibly excited about." In September, the iconic "Seinfeld" alum found out she had breast cancer the day after winning her sixth consecutive Emmy Award for playing career politician Selina Meyer. She went public with her diagnosis days later. HBO, which had already announced plans to end the series in 2018, said Louis-Dreyfus' diagnosis had no bearing on the decision. The premium cable network also said that it would adjust production as needed.
It should have found:
body = I am not very trusting of directors. I go in with my fists up -- or at least my cards really close to my chest, because I have been burned before. I find that directors have a hard time believing that a young actress is going to have an artistic opinion that is worth something. MAGGIE GYLLENHAAL, 2003
Here is a second detailed example:
url = http://www.latimes.com/politics/la-na-pol-essential-washington-updates-president-trump-suggests-taking-the-guns-1519871936-htmlstory.html
title = 'Take the guns first,' Trump says in contentious meeting with lawmakers
body = Nancy Pelosi (D-San Francisco) wields the speaker's gavel after being elected as the first female speaker of the House in 2007. (Chip Somodevilla / Getty Images) House Minority Leader Nancy Pelosi’s mark on history will soon become part of the Smithsonian, with a donation of three items related to her swearing-in as the first woman to serve as speaker of the House. Pelosi will donate a lacquered maple gavel, the burgundy pantsuit she wore and a copy of the speech she gave on the morning of Jan. 4, 2007, to the Smithsonian’s National Museum of American History. She gave up the job four years later after Republicans won a majority and took control of the House. Democrats are itching to regain the 24 seats they need to retake the House and potentially put the San Francisco Democrat back in the speaker’s chair during her 17th term in the House.
It should have found:
body = Take the guns first, go through due process second. President Trump In a contentious White House meeting about guns Wednesday, President Trump suggested law enforcement officials should be able to confiscate people’s firearms without a court order to prevent potential tragedies. Trump also told lawmakers to send him “one terrific bill” on gun safety, including stronger background checks and restrictions based on one’s age and mental health.
Here are 4 more examples of pages with the same symptom:
url = http://www.latimes.com/entertainment/la-et-entertainment-news-updates-2018-a-star-is-born-lizzy-caplan-turns-36-1530013803-htmlstory.html title = A Star Is Born: Lizzy Caplan turns 36 today body = [ story about Drake / Pusha-T ]
url = http://www.latimes.com/entertainment/la-et-entertainment-news-updates-2018-a-star-is-born-margot-robbie-turns-28-1530307411-htmlstory.html title = A Star Is Born: Margot Robbie turns 28 today body = [ story about Drake / Pusha-T ]
url = https://www.latimes.com/politics/la-lb-767-45592-la-na-pol-democrat-supreme-court-20180708-htmlstory.html title = Democrats' long-shot plan to stop Trump's Supreme Court pick body = [ story about Secretary of State Michael R. Pompeo visit to North Korea ]
url = http://www.latimes.com/sports/lakers/la-sp-lebron-james-lakers-updates-meet-the-greatest-player-to-wear-a-1530569433-htmlstory.html title = Meet the greatest player to wear a Lakers jersey: LeBron James body = [ story about Lakers offer to Julius Randle ]
In all of these examples, it appears that the Body extractor is skipping over the body of the first article, and finding the body of the second or sometimes third article in the sequence, resulting in the Wrong Body being found.
Related information: This problem appears to mainly appear on LA times, which seems to have a lot of these multi-article pages (44.9% of LA Times pages had this problem). For The Atlantic, 9.9% of pages had this problem. For NBC News, 7.4% of pages had this problem. For the other 10 sources in our crawl, < 2% of pages had this problem.
Thank you for your consideration of this problem, and for creating and supporting Newspaper. It is extremely useful and we are very glad to be able to use it. Please let us know if you need any more information to confirm this issue, or to investigate it.