python-goose issues

Goose fails on nytimes articles

2

``` >>> from goose import Goose >>> url = 'http://www.nytimes.com/2015/05/02/nyregion/christie-ally-expected-to-plead-guilty-in-george-washington-bridge-lane-closing-case.html?hp&action=click&pgtype=Homepage&module=span-ab-lede-package-region&region=top-news&WT.nav=top-news' >>> g=Goose() >>> article = g.extract(url=url) >>> article.title u'' >>> article.meta_description u'' >>> article.cleaned_text u'' ```

lsemel

Russian articles are not extracted

Seems that article.cleaned_text from here #135 does not work. article_text is always empty.

szhem

Turkish stopwords added

Python goose doesn't work on Turkish web pages without Turkish stopwords resource file.

ufukk

Fix title extraction if title is same as site_name

1

Fix IndexError if title is the same as site_name and add test for this case. Fix for #194.

vetal4444

No Text Extracted for articles from domain http://www.clarin.com

1

Hi, I tried extracting the content for articles from http://www.clarin.com, but goose was unable to extract any content from any article under the clarin.com domain (like http://www.clarin.com/politica/Luego-Cristina-Lorenzetti-apertura-judicial_0_1313868802.html). Goose always returns...

sathappanspm

Dateline in articles

I am interested in extracting article datelines. Goose removes them when in a tag different from the one where the main text of the article is located. Is there to...

cvelascorivera

Bad case for image extraction

1

Hi, I use goose to extract images from a Chinese news site. Some news articles dont't have images. But goose gives me one from the sidebar of the page. For...

stephenLee

Og site_name issue

#203

grangier

Getting a No Such File or Directory error

1

``` File "goose/__init__.py", line 37, in __init__ self.initialize() File "goose/__init__.py", line 81, in initialize os.remove(path) ``` Separately, is there a way to use Goose without the need for tempfiles?

lsemel

Algorithm used in goose ?

2

Hi, I am working on my undergrad research thesis and using goose extractor.Goose is really a commendable tool. However, I have a mid term presentation regarding my thesis and I...

IndianShifu

python-goose
python-goose copied to clipboard

Metadata

Goose fails on nytimes articles

Russian articles are not extracted

Turkish stopwords added

Fix title extraction if title is same as site_name

No Text Extracted for articles from domain http://www.clarin.com

Dateline in articles

Bad case for image extraction

Og site_name issue

Getting a No Such File or Directory error

Algorithm used in goose ?

← Metadata

Owner

Metadata

python-goose python-goose copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-goose
python-goose copied to clipboard