python-goose issues

join sentences with a whitespace

2

Join the replacetext with a whitespace. Otherwise two different sentence can be mixed. For eg: url = http://timesofindia.indiatimes.com/tech/tech-news/Facebooks-Mark-Zuckerberg-in-India-today/articleshow/44740431.cms from goose import Goose g = Goose() a = g.extract(url) a.cleaned_text u'NEW...

ankushshah89

Enhancement : Overrides / Seeding

Due to the nature of extraction using content and link density analysis the process is open to errors. An ability to override or seed specific elements would allow more accurate...

ajmcgarry

Enhancement: Additional Article properties

HTML It would be beneficial to return the inner html of the article candidate tag as well as the cleaned text version. Images Array It would be beneficial to return...

ajmcgarry

Unchecked input language field can cause IOError

2

There is a repeatable error with some malformed HTML language meta tags that causes an IOError within goose. This is due to trusting the meta tag input in the OutputFormatter.get_language...

uncommoncode

Unicode encoding problems while check stop words

3

Have tried to extract russian article but gosse produced empty result. I tried to debug and have found out that extracted content (text from p tag) can not be found...

vladimir-shmidt

sperating paragraph with new line

1

Clean text looks clumsy as paragraphs are not seprates, add a new line for each paragraph

gauravaror

Switching to beautifulsoup4 for Python 3 support?

4

Are there any plans to change the `beautifulsoup` dependency to `beautifulsoup4` for Python 3 support? Or are there other factors as well before this will be py3 compatible?

frnsys

Change selection of top image

A few tweaks in this PR: - leave local image comparison as a last resort, as its the most costly - added timeout to image fetching as some bad behaved...

mwjackson

Add support for semantic markup

3

It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph. Example - http://www.wired.com/2014/05/star-wars-storyboards-video/ Wired makes effective use of schema.org as seen...

jeffnappi

http://www.thekitchn.com/ not parsed due to class regex

http://www.thekitchn.com/recipe-morel-mushroom-amp-leek-quesadillas-recipes-from-the-kitchn-203684 It has body div with class="post-body branded-links". The class that has "links" is identified as bad tags by clean_bad_tags() in cleaners.py method. Hence it breaks. Let me know if...

kambanthemaker

python-goose
python-goose copied to clipboard

Metadata

join sentences with a whitespace

Enhancement : Overrides / Seeding

Enhancement: Additional Article properties

Unchecked input language field can cause IOError

Unicode encoding problems while check stop words

sperating paragraph with new line

Switching to beautifulsoup4 for Python 3 support?

Change selection of top image

Add support for semantic markup

http://www.thekitchn.com/ not parsed due to class regex

← Metadata

Owner

Metadata

python-goose python-goose copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-goose
python-goose copied to clipboard