newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

Blogger / Blogspot issue

Open ontopicprojects opened this issue 3 years ago • 2 comments

Some blogspot / blogger sites don't seem to parse: here is an example:

`from newspaper import Article

url = 'http://www.righto.com/2011/07/cells-are-very-fast-and-crowded-places.html'

article = Article(url) article.download() article.parse() print(article.text)`

this prints ""

ontopicprojects avatar Oct 05 '20 15:10 ontopicprojects

The primary reason that you cannot extract from this site with Newspaper is because the tags commonly queried by this module do not exist on the website http://www.righto.com. You should use the Python libraries requests and BeautifulSoup to extract the items that you want.

johnbumgarner avatar Oct 08 '20 03:10 johnbumgarner

you can achieve this with some minor changes.

  1. the cleaners class must be fixed (it removes itemprop containing articleBody if it is not exactly "articleBody") see https://github.com/codelucas/newspaper/pull/953
  2. extend the extractor class and replace it in your article :+1:
class myExtractor(ContentExtractor):
    def nodes_to_check(self, doc):
        generator = self.parser.getElementsByTag(doc, tag='meta', attr={'name':'generator'})
        for t in generator:
            if t.attrib.get('content') == 'blogger':
                nodes = self.parser.getElementsByTag(doc, tag='div', attr={'class':'post-body'})
                return nodes
        return super().nodes_to_check(doc)

and


a = Article('http://www.righto.com/2011/07/cells-are-very-fast-and-crowded-places.html')
a.extractor = myExtractor(a.config)
a.download()
a.parse()

AndyTheFactory avatar Oct 04 '22 16:10 AndyTheFactory