newspaper
newspaper copied to clipboard
Fix xpath selector for extracting feeds
For https://github.com/codelucas/newspaper/issues/731
Also added a test case based on a mock already included in the test data fixtures.
Adapting the test code from the issue I created:
import newspaper
def debug_source(url):
source = newspaper.build(
url,
download_images = False,
fetch_images = False,
)
print(url)
print('feeds:', [f.url for f in source.feeds])
if __name__ == '__main__':
debug_source('https://www.npr.org/')
debug_source('https://techcrunch.com')
debug_source('https://vox.com')
The new output is now:
$ python test_np.py
https://www.npr.org/
feeds: ['https://www.npr.org/rss/rss.php?id=718730324', 'https://www.npr.org/rss/rss.php?id=1002', 'https://www.npr.org/rss/rss.php?id=688409791', 'https://www.npr.org/rss/rss.php?id=690263240', 'https://www.npr.org/rss/rss.php?id=1032', 'https://www.npr.org/rss/rss.php?id=1001', 'https://www.npr.org/rss/rss.php?id=35', 'https://www.npr.org/rss/rss.php?id=750001', 'https://www.npr.org/rss/rss.php?id=1008', 'https://www.npr.org/rss/rss.php?id=3', 'https://www.npr.org/rss/rss.php?id=376751684', 'https://www.npr.org/rss/rss.php?id=1039', 'https://www.npr.org/rss/rss.php?id=2', 'https://www.npr.org/rss/rss.php?id=750005', 'https://www.npr.org/rss/rss.php?id=750002']
https://techcrunch.com
feeds: ['https://techcrunch.com/comments/feed/', 'https://techcrunch.com/feed/']
https://vox.com
feeds: ['https://www.vox.com/rss/videos/index.xml', 'https://www.vox.com/rss/technology/index.xml', 'https://vox.com/rss/index.xml', 'https://www.vox.com/rss/recode/index.xml', 'https://www.vox.com/rss/identities/index.xml', 'https://www.vox.com/rss/culture/index.xml', 'https://www.vox.com/rss/first-person/index.xml', 'https://www.vox.com/rss/front-page/index.xml', 'https://www.vox.com/rss/world/index.xml', 'https://www.vox.com/rss/explainers/index.xml']
I'm trying to get through https://theverge.com, any update on this PR?