python-goose
python-goose copied to clipboard
[extractors/title.py] None value for `site_name` in line 40
Trigger
>>> from goose import Goose
>>> url = ' https://www.alienvault.com/blogs/security-essentials/11-simple-yet-important-tips-to-secure-aws'
>>> g = Goose()
>>> article = g.extract(url=url)
Traceback
Traceback (most recent call last):
article = g.extract(url=url)
File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract
return self.crawl(cc)
File "/scripts/venv/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl
article = crawler.crawl(crawl_candiate)
File "/scripts/venv/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl
self.article.title = self.title_extractor.extract()
File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract
return self.get_title()
File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title
return self.clean_title(title)
File "/scripts/venv/lib/python2.7/site-packages/goose/extractors/title.py", line 42, in clean_title
title = title.replace(site_name, '').strip()
TypeError: expected a string or other character buffer object
Fix
Make sure to check the value of site_name
after this line. If it is None
, dont fix the title.
if "site_name" in self.article.opengraph.keys():
site_name = self.article.opengraph['site_name']
# remove the site name from title
if site_name:
title = title.replace(site_name, '').strip()