python-goose issues

Do we need to download images to file?

24

Aren't we only looking for width,height and number of bytes? A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for...

atlithorn

Clarification on how raw_html gets extracted

Hi! I just wanted to get some clarification on what to expect from `Goose().extract(raw_html=...)`? I tried it with `u"Hello World"` and `cleaned_text` was empty. So I crafted the example below:...

konradkonrad

Indonesian stopwords file contains too many other words than stopwords

I'm aware that there is no single universal list for stopwords. But the current [stopwords-id.txt](https://github.com/grangier/python-goose/blob/develop/goose%2Fresources%2Ftext%2Fstopwords-id.txt) (currently 1309 sloc) contains too many words that aren't stop words (eg: not [function words](http://en.wikipedia.org/wiki/Function_word))....

luthfianto

Not getting any extracted text

1

Tried the following, but only got the title, and no text: > > > from goose import Goose > > > url = 'http://householdproducts.nlm.nih.gov/cgi-bin/household/list?tbl=TblBrands&alpha=0' > > > g = Goose()...

peterswang

Maximum recursion depth exceeded

This recursive call https://github.com/grangier/python-goose/blob/develop/goose/__init__.py#L69 is raising the exception ``` RuntimeError: maximum recursion depth exceeded in __instancecheck__ ``` The default recursion limit is 1000 and it can be exceeded.

slitayem

Spelling Error in documentation

On line 35, in the documentation found at https://github.com/grangier/python-goose/blob/develop/goose/extractors/title.py, it's supposed to be "get rid of site name" not "get ride of site name"

ghost

provide a facility to get all text in a webpage

Hi, Very good tool you have here, but since every user will need the whole text or part of it, is better to have a function to get the whole...

aqp

Bad cleaned text extraction

1

The cleaned text of the page http://www.galeria-kaufhof.de/store/p/Tefal-Herzwaffeleisen-WM-310-D/1004259860 is ` u'Mit Ihrer Hilfe k\xf6nnen wir besser werden.\n\nSie haben uns damit sehr geholfen. Bitte beachten Sie, dass wir Ihnen Ihr Feedback nicht...

slitayem

non-complete body extraction

http://edition.cnn.com/2015/01/07/football/steven-gerrard-la-galaxy/index.html The body should be extracted as "(CNN) From Liverpool to Los Angeles -- Steven Gerrard's move to California has been confirmed. The former England captain will play Major League...

canhduong28

Improve title cleaning

Fix for #196 issue. If after title cleaning we still have TITLE_SPLITTER in title, use old algorithm of title cleaning (from previous version of goose).

vetal4444

python-goose
python-goose copied to clipboard

Metadata

Do we need to download images to file?

Clarification on how raw_html gets extracted

Indonesian stopwords file contains too many other words than stopwords

Not getting any extracted text

Maximum recursion depth exceeded

Spelling Error in documentation

provide a facility to get all text in a webpage

Bad cleaned text extraction

non-complete body extraction

Improve title cleaning

← Metadata

Owner

Metadata

python-goose python-goose copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-goose
python-goose copied to clipboard