python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Html Content / Article Extractor, web scrapping lib in Python

Results 100 python-goose issues
Sort by recently updated
recently updated
newest added

``` >>> from goose import Goose >>> url = 'http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html?hp&action=click&pgtype=Homepage&module=span-ab-lede-package-region&region=top-news&WT.nav=top-news' >>> g=Goose() >>> article = g.extract(url=url) >>> article.title u'' >>> article.meta_description u'' >>> article.cleaned_text u'' ```

Seems that article.cleaned_text from here #135 does not work. article_text is always empty.

Python goose doesn't work on Turkish web pages without Turkish stopwords resource file.

Fix IndexError if title is the same as site_name and add test for this case. Fix for #194.

Hi, I tried extracting the content for articles from http://www.clarin.com, but goose was unable to extract any content from any article under the clarin.com domain (like http://www.clarin.com/politica/Luego-Cristina-Lorenzetti-apertura-judicial_0_1313868802.html). Goose always returns...

I am interested in extracting article datelines. Goose removes them when in a tag different from the one where the main text of the article is located. Is there to...

Hi, I use goose to extract images from a Chinese news site. Some news articles dont't have images. But goose gives me one from the sidebar of the page. For...

``` File "goose/__init__.py", line 37, in __init__ self.initialize() File "goose/__init__.py", line 81, in initialize os.remove(path) ``` Separately, is there a way to use Goose without the need for tempfiles?

Hi, I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I...