python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Clarification on how raw_html gets extracted

Open konradkonrad opened this issue 9 years ago • 0 comments

Hi! I just wanted to get some clarification on what to expect from Goose().extract(raw_html=...)? I tried it with u"<body>Hello World</body>" and cleaned_text was empty. So I crafted the example below:

In [83]: Goose().extract(url=None, raw_html=u"<html><head><title>Hello</title></head>"
  u"<body><h1>Hello World.</h1> If the text gets longer it may work... or maybe not? "
  u"Come on! This is the longest text part in the whole body. How can you ignore me? "
  u"It really is! I could go on here for ever and you don't see me??? "
  u"<div>HTML without divs is probably old-fashioned?</div>" 
  u"<div class='content'>You are not looking for keyword classes?</div>"
  u"<div class='unimportant'>But what now? Is here even more content?</div>"
  u"And everything outside is lost?</body></html>").cleaned_text
Out[83]: u'HTML without divs is probably old-fashioned?\n\nYou are not looking for keyword classes?\n\nBut what now? Is here even more content?'

To paraphrase my question: is there some sort of minimum markup needed, for body text to get extracted?

konradkonrad avatar Feb 27 '15 12:02 konradkonrad