python-goose
python-goose copied to clipboard
Clarification on how raw_html gets extracted
Hi! I just wanted to get some clarification on what to expect from Goose().extract(raw_html=...)
? I tried it with u"<body>Hello World</body>"
and cleaned_text
was empty. So I crafted the example below:
In [83]: Goose().extract(url=None, raw_html=u"<html><head><title>Hello</title></head>"
u"<body><h1>Hello World.</h1> If the text gets longer it may work... or maybe not? "
u"Come on! This is the longest text part in the whole body. How can you ignore me? "
u"It really is! I could go on here for ever and you don't see me??? "
u"<div>HTML without divs is probably old-fashioned?</div>"
u"<div class='content'>You are not looking for keyword classes?</div>"
u"<div class='unimportant'>But what now? Is here even more content?</div>"
u"And everything outside is lost?</body></html>").cleaned_text
Out[83]: u'HTML without divs is probably old-fashioned?\n\nYou are not looking for keyword classes?\n\nBut what now? Is here even more content?'
To paraphrase my question: is there some sort of minimum markup needed, for body text to get extracted?