python-goose issues

HtmlFetcher does not handle gzip compression

2

Some servers force gzip compression on their content, which HtmlFetcher does not deal gracefully with because urllib2 assumes non-compressed content. Cheapest/easiest solution would be to check the encoding header on...

kqr

fixing new york times content extraction failure

1

The order of the content tags is breaking NYTimes extraction. I added a file for unittests but didn't actually create any, too busy ;(

robmcdan

Title of project says "scrapping" but it's "scraping"

doda-zz

og:image is not parsed correct if e.g. og:image:width exists on page

1

og:image is parsed correctly at first if there are more og:image attributes, e.g. og:image:width it replaces the image attribute.

vonholst

h1,h2...h6 not returned

1

When I extracted articles from any page, I have noticed it don't return any heading "tag" like h1,h2...h6 value in cleaned_text. Is that normal for everyone or I have missed...

tamimibrahim

Fallback to 'http' as default url schema if needed

I have noticed that goose is not playing well with urls without proper schema so i created this fix. I'm not sure if this check/fix is necessary on library level,...

rastasheep

Added Serbian stopwords

rastasheep

Goose is not working on extracting data from Kissmetrics blog which have some meta tags present.

1

I am trying to extract content from http://feedproxy.google.com/~r/KISSmetrics/~3/cmb43Q4Mzak/ which gets redirected to this https://blog.kissmetrics.com/optimize-your-social-media-ad-spend-with-advanced-targeting-options/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+KISSmetrics+%28KISSmetrics+Marketing+Blog%29 I am getting below error. File "D:\env\lib\site-packages\goose__init__.py", line 56, in extract return self.crawl(cc) File "D:\env\lib\site-packages\goose__init__.py", line...

jijoy

fix(stopwords-id.txt): changed to Lucene stopwords

Hi, I changed the previous stopwords with Lucene`s Indonesian stopwords, because it contains too many words that aren't stop words (eg: not [function words](http://en.wikipedia.org/wiki/Function_word)). It was more an "Indonesian word...

luthfianto

Non-obvious failure grabbing top_image

2

I'm running goose on a few urls and it returns an empty string on this one in particular: http://shop.nordstrom.com/s/lancer-skincare-sheer-fluid-sun-shield-spf-30/3565107 I guess the thing that confuses me here is that it...

Slater-Victoroff

python-goose
python-goose copied to clipboard

Metadata

HtmlFetcher does not handle gzip compression

fixing new york times content extraction failure

Title of project says "scrapping" but it's "scraping"

og:image is not parsed correct if e.g. og:image:width exists on page

h1,h2...h6 not returned

Fallback to 'http' as default url schema if needed

Added Serbian stopwords

Goose is not working on extracting data from Kissmetrics blog which have some meta tags present.

fix(stopwords-id.txt): changed to Lucene stopwords

Non-obvious failure grabbing top_image

← Metadata

Owner

Metadata

python-goose python-goose copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-goose
python-goose copied to clipboard