python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Html Content / Article Extractor, web scrapping lib in Python

Results 100 python-goose issues
Sort by recently updated
recently updated
newest added

Join the replacetext with a whitespace. Otherwise two different sentence can be mixed. For eg: url = http://timesofindia.indiatimes.com/tech/tech-news/Facebooks-Mark-Zuckerberg-in-India-today/articleshow/44740431.cms from goose import Goose g = Goose() a = g.extract(url) a.cleaned_text u'NEW...

Due to the nature of extraction using content and link density analysis the process is open to errors. An ability to override or seed specific elements would allow more accurate...

HTML It would be beneficial to return the inner html of the article candidate tag as well as the cleaned text version. Images Array It would be beneficial to return...

There is a repeatable error with some malformed HTML language meta tags that causes an IOError within goose. This is due to trusting the meta tag input in the OutputFormatter.get_language...

Have tried to extract russian article but gosse produced empty result. I tried to debug and have found out that extracted content (text from p tag) can not be found...

Clean text looks clumsy as paragraphs are not seprates, add a new line for each paragraph

Are there any plans to change the `beautifulsoup` dependency to `beautifulsoup4` for Python 3 support? Or are there other factors as well before this will be py3 compatible?

A few tweaks in this PR: - leave local image comparison as a last resort, as its the most costly - added timeout to image fetching as some bad behaved...

It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph. Example - http://www.wired.com/2014/05/star-wars-storyboards-video/ Wired makes effective use of schema.org as seen...

http://www.thekitchn.com/recipe-morel-mushroom-amp-leek-quesadillas-recipes-from-the-kitchn-203684 It has body div with class="post-body branded-links". The class that has "links" is identified as bad tags by clean_bad_tags() in cleaners.py method. Hence it breaks. Let me know if...