python-goose
python-goose copied to clipboard
Html Content / Article Extractor, web scrapping lib in Python
``` >>> from goose import Goose >>> url = 'http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html?hp&action=click&pgtype=Homepage&module=span-ab-lede-package-region®ion=top-news&WT.nav=top-news' >>> g=Goose() >>> article = g.extract(url=url) >>> article.title u'' >>> article.meta_description u'' >>> article.cleaned_text u'' ```
Seems that article.cleaned_text from here #135 does not work. article_text is always empty.
Python goose doesn't work on Turkish web pages without Turkish stopwords resource file.
Fix IndexError if title is the same as site_name and add test for this case. Fix for #194.
Hi, I tried extracting the content for articles from http://www.clarin.com, but goose was unable to extract any content from any article under the clarin.com domain (like http://www.clarin.com/politica/Luego-Cristina-Lorenzetti-apertura-judicial_0_1313868802.html). Goose always returns...
I am interested in extracting article datelines. Goose removes them when in a tag different from the one where the main text of the article is located. Is there to...
Hi, I use goose to extract images from a Chinese news site. Some news articles dont't have images. But goose gives me one from the sidebar of the page. For...
#203
``` File "goose/__init__.py", line 37, in __init__ self.initialize() File "goose/__init__.py", line 81, in initialize os.remove(path) ``` Separately, is there a way to use Goose without the need for tempfiles?
Hi, I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I...