newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Results 152 newspaper issues
Sort by recently updated
recently updated
newest added

Is this project still maintained? I see a lot of Pull Requests and the last commit to the code was Sep 2, 2020

The publication date of [this article](https://bmcwomenshealth.biomedcentral.com/articles/10.1186/s12905-022-02136-8) is reported as 2023-12-10. This is impossible, as the article is downloaded on 2023-03-10. The article lists it's own publication date as 2023-01-02. Further...

Sorry for the unrelated commits, but I just found how to create a PR but not for a specific or set of commits :(.

Hello, seven years ago this was posted: https://github.com/codelucas/newspaper/issues/245 I have a problem that requires me to scrape a large corpus of titles from 2013-2019 from various news sources. Ideally I...

I'd like to bring to your attention that we are [discussing](https://bugs.launchpad.net/lxml/+bug/1958539) the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning...

I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in : meta[property='{}'] meta[name='{}'] meta[itemprop='{}'] Meta tags for publication and...

For https://github.com/codelucas/newspaper/issues/731 Also added a test case based on a mock already included in the test data fixtures. Adapting the test code from the issue I created: ``` import newspaper...

https://gnews.org/articles/1068907 used `article.text` for this page, and no text got. and build for gnews is not working too. ```python import newspaper gnews = newspaper.build('https://gnews.org/', language='zh') article = gnews.articles[0] article.download() article.parse()...

Hi all. I was using newspaper3k and it was working fine, but today it stopped working and returns empty text. Does anyone have any ideas?

The following doesn't timeout nor return anything. ``` url = "http://http-live.sr.se/srextra01-mp3-192" article = newspaper.Article(url, request_timeout=5) article.download() ``` Same with: ``` from newspaper.network import get_html_2XX_only article.config.__dict__ {'MIN_WORD_COUNT': 300, 'MIN_SENT_COUNT': 7, 'MAX_TITLE':...