python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Problems Parsing Titles

Open grantdelozier opened this issue 8 years ago • 1 comments

Seeing extraction errors on certain websites that have titles.

File "/usr/local/lib/python2.7/site-packages/ContentAnalysis-0.1.1-py2.7.egg/ContentAnalysis/document.py", line 53, in parse ginfo = g.extract(url=self.link) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract return self.crawl(cc) File "/usr/local/lib/python2.7/site-packages/goose/__init__.py", line 66, in crawl article = crawler.crawl(crawl_candiate) File "/usr/local/lib/python2.7/site-packages/goose/crawler.py", line 154, in crawl self.article.title = self.title_extractor.extract() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 99, in extract return self.get_title() File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 78, in get_title return self.clean_title(title) File "/usr/local/lib/python2.7/site-packages/goose/extractors/title.py", line 56, in clean_title if title_words[0] in TITLE_SPLITTERS: IndexError: list index out of range

You can replicate by running goose extract on a site like http://daydreamingfoodie.com/

grantdelozier avatar Oct 03 '16 19:10 grantdelozier

The issue on this site and plenty of others stems from when the title = opengraph site name

Fixed the issue in this commit of my fork

grantdelozier avatar Oct 03 '16 20:10 grantdelozier