Article includes irrelevant subscription and notification information

Open AndyTheFactory opened this issue 2 years ago • 1 comments

Issue by leesamu Tue Jul 24 21:57:52 2018 Originally opened as https://github.com/codelucas/newspaper/issues/600

Newspaper has been working better for me than any other substitute I can find. I'm using it in a project to scrape stories from a large variety of news domains. I've found that some domains include extra information about, in one particular case, email notifications. I think this is a more general problem, where extra text is sometimes collected, and important text is sometimes missed when parsing into Article.text. This is the cleaned text result of an article from "The Monitor," link found here:

https://www.themonitor.com/news/business/article_2dc1eb58-70fe-11e8-ba6b-7fdacdd2acb6.html

Close Get email notifications on ********* daily! Your notification has been saved. There was a problem saving your notification. Whenever ********** posts new content, you'll get an email delivered to your inbox with a link. Email notifications are only sent once a day, and only if there are new matching items.

There is an actual story on the page, but newspaper doesn't download any of it. Is there a good way to download only the article text from this domain, and other domains with similar problems. I'd especially like to avoid Article.text including any of the email notification text if possible.

Thank you if you're able to help. Newspaper has been an incredibly useful tool so far.

*Edit:

I've also found the same thing happen with the NYTimes, which is a much bigger problem. The article is correctly downloaded and parsed for the most part, but article.text includes a big paragraph on signing up to the NYTimes newsletter, which I posted below:

Newsletter Sign Up Continue reading the main story Please verify you're not a robot by clicking the box. Invalid email address. Please re-enter. You must select a newsletter to subscribe to. Sign Up You will receive emails containing news content , updates and promotions from The New York Times. You may opt-out at any time. You agree to receive occasional updates and special offers for The New York Times's products and services. Thank you for subscribing. An error has occurred. Please try again later. View all New York Times newsletters.

Oct 24 '23 12:10 AndyTheFactory

Comment by ekingery Mon Oct 15 21:36:56 2018

I am in a similar situation, in that I've found various minor issues with the way article content is parsed and extracted. Even if a bunch of relatively minor tweaks would be helpful (for example, using the <article> tag for this article which currently fails to parse properly), it's not clear to me that this library is actively maintained and would adopt them.

The approach I've taken is to use both newspaper and goose3. My code then compares the results in an attempt to choose the result that looks most correct. We default to newspaper, and I've seen plenty of cases where newspaper is correct and goose3 is not. YMMV. Hope that helps!

Oct 24 '23 12:10 AndyTheFactory