python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Html Content / Article Extractor, web scrapping lib in Python

Results 100 python-goose issues
Sort by recently updated
recently updated
newest added

Aren't we only looking for width,height and number of bytes? A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for...

Hi! I just wanted to get some clarification on what to expect from `Goose().extract(raw_html=...)`? I tried it with `u"Hello World"` and `cleaned_text` was empty. So I crafted the example below:...

I'm aware that there is no single universal list for stopwords. But the current [stopwords-id.txt](https://github.com/grangier/python-goose/blob/develop/goose%2Fresources%2Ftext%2Fstopwords-id.txt) (currently 1309 sloc) contains too many words that aren't stop words (eg: not [function words](http://en.wikipedia.org/wiki/Function_word))....

Tried the following, but only got the title, and no text: > > > from goose import Goose > > > url = 'http://householdproducts.nlm.nih.gov/cgi-bin/household/list?tbl=TblBrands&alpha=0' > > > g = Goose()...

This recursive call https://github.com/grangier/python-goose/blob/develop/goose/__init__.py#L69 is raising the exception ``` RuntimeError: maximum recursion depth exceeded in __instancecheck__ ``` The default recursion limit is 1000 and it can be exceeded.

On line 35, in the documentation found at https://github.com/grangier/python-goose/blob/develop/goose/extractors/title.py, it's supposed to be "get rid of site name" not "get ride of site name"

Hi, Very good tool you have here, but since every user will need the whole text or part of it, is better to have a function to get the whole...

The cleaned text of the page http://www.galeria-kaufhof.de/store/p/Tefal-Herzwaffeleisen-WM-310-D/1004259860 is ` u'Mit Ihrer Hilfe k\xf6nnen wir besser werden.\n\nSie haben uns damit sehr geholfen. Bitte beachten Sie, dass wir Ihnen Ihr Feedback nicht...

http://edition.cnn.com/2015/01/07/football/steven-gerrard-la-galaxy/index.html The body should be extracted as "(CNN) From Liverpool to Los Angeles -- Steven Gerrard's move to California has been confirmed. The former England captain will play Major League...

Fix for #196 issue. If after title cleaning we still have TITLE_SPLITTER in title, use old algorithm of title cleaning (from previous version of goose).