python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Html Content / Article Extractor, web scrapping lib in Python

Results 100 python-goose issues
Sort by recently updated
recently updated
newest added

Some servers force gzip compression on their content, which HtmlFetcher does not deal gracefully with because urllib2 assumes non-compressed content. Cheapest/easiest solution would be to check the encoding header on...

The order of the content tags is breaking NYTimes extraction. I added a file for unittests but didn't actually create any, too busy ;(

og:image is parsed correctly at first if there are more og:image attributes, e.g. og:image:width it replaces the image attribute.

When I extracted articles from any page, I have noticed it don't return any heading "tag" like h1,h2...h6 value in cleaned_text. Is that normal for everyone or I have missed...

I have noticed that goose is not playing well with urls without proper schema so i created this fix. I'm not sure if this check/fix is necessary on library level,...

I am trying to extract content from http://feedproxy.google.com/~r/KISSmetrics/~3/cmb43Q4Mzak/ which gets redirected to this https://blog.kissmetrics.com/optimize-your-social-media-ad-spend-with-advanced-targeting-options/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+KISSmetrics+%28KISSmetrics+Marketing+Blog%29 I am getting below error. File "D:\env\lib\site-packages\goose__init__.py", line 56, in extract return self.crawl(cc) File "D:\env\lib\site-packages\goose__init__.py", line...

Hi, I changed the previous stopwords with Lucene`s Indonesian stopwords, because it contains too many words that aren't stop words (eg: not [function words](http://en.wikipedia.org/wiki/Function_word)). It was more an "Indonesian word...

I'm running goose on a few urls and it returns an empty string on this one in particular: http://shop.nordstrom.com/s/lancer-skincare-sheer-fluid-sun-shield-spf-30/3565107 I guess the thing that confuses me here is that it...