python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

[extractors/title.py] None value for `site_name` in line 40

Open kbandla opened this issue 7 years ago • 0 comments

Trigger

>>> from goose import Goose
>>> url = ' https://www.alienvault.com/blogs/security-essentials/11-simple-yet-important-tips-to-secure-aws'
>>> g = Goose()
>>> article = g.extract(url=url)

Traceback

Traceback (most recent call last):
    article = g.extract(url=url)
  File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl
    article = crawler.crawl(crawl_candiate)
  File "/scripts/venv/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl
    self.article.title = self.title_extractor.extract()
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract
    return self.get_title()
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title
    return self.clean_title(title)
  File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 42, in clean_title
    title = title.replace(site_name, '').strip()
TypeError: expected a string or other character buffer object

Fix

Make sure to check the value of site_name after this line. If it is None, dont fix the title.

        if "site_name" in self.article.opengraph.keys():
            site_name = self.article.opengraph['site_name']
            # remove the site name from title
            if site_name:
                title = title.replace(site_name, '').strip()

kbandla avatar Mar 11 '17 18:03 kbandla