newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

Fix xpath selector for extracting feeds

Open sirpengi opened this issue 4 years ago • 1 comments

For https://github.com/codelucas/newspaper/issues/731

Also added a test case based on a mock already included in the test data fixtures.

Adapting the test code from the issue I created:

import newspaper
  
def debug_source(url):
  source = newspaper.build(
    url,
    download_images = False,
    fetch_images = False,
  )
  print(url)
  print('feeds:', [f.url for f in source.feeds])


if __name__ == '__main__':
  debug_source('https://www.npr.org/')
  debug_source('https://techcrunch.com')
  debug_source('https://vox.com')

The new output is now:

$ python test_np.py 
https://www.npr.org/
feeds: ['https://www.npr.org/rss/rss.php?id=718730324', 'https://www.npr.org/rss/rss.php?id=1002', 'https://www.npr.org/rss/rss.php?id=688409791', 'https://www.npr.org/rss/rss.php?id=690263240', 'https://www.npr.org/rss/rss.php?id=1032', 'https://www.npr.org/rss/rss.php?id=1001', 'https://www.npr.org/rss/rss.php?id=35', 'https://www.npr.org/rss/rss.php?id=750001', 'https://www.npr.org/rss/rss.php?id=1008', 'https://www.npr.org/rss/rss.php?id=3', 'https://www.npr.org/rss/rss.php?id=376751684', 'https://www.npr.org/rss/rss.php?id=1039', 'https://www.npr.org/rss/rss.php?id=2', 'https://www.npr.org/rss/rss.php?id=750005', 'https://www.npr.org/rss/rss.php?id=750002']
https://techcrunch.com
feeds: ['https://techcrunch.com/comments/feed/', 'https://techcrunch.com/feed/']
https://vox.com
feeds: ['https://www.vox.com/rss/videos/index.xml', 'https://www.vox.com/rss/technology/index.xml', 'https://vox.com/rss/index.xml', 'https://www.vox.com/rss/recode/index.xml', 'https://www.vox.com/rss/identities/index.xml', 'https://www.vox.com/rss/culture/index.xml', 'https://www.vox.com/rss/first-person/index.xml', 'https://www.vox.com/rss/front-page/index.xml', 'https://www.vox.com/rss/world/index.xml', 'https://www.vox.com/rss/explainers/index.xml']

sirpengi avatar Aug 23 '19 06:08 sirpengi