html-text
html-text copied to clipboard
guess_layout does not work on XHTML elements
After the failure of extract_text in #24, I tried etree_to_text.
I got through that without encountering an exception, but guess_layout doesn't work: no newlines are added after those tags.
I think it's because element.tag includes the tag's XML namespace, so it doesn't match the namespaceless NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS.
Test:
def test_guess_layout():
xhtml = (u'<html xmlns="http://www.w3.org/1999/xhtml">'
'<head><title> title </title></head>'
'<body><div>text_1.<p>text_2 text_3</p>'
'<p id="demo"></p><ul><li>text_4</li><li>text_5</li></ul>'
'<p>text_6<em>text_7</em>text_8</p>text_9</div>'
'<script>document.getElementById("demo").innerHTML = '
'"This should be skipped";</script> <p>...text_10</p>'
'</body></html>')
text = ('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
'\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10')
assert extract_text(xhtml, guess_punct_space=False, guess_layout=True) == text
assert etree_to_text(etree.fromstring(xhtml,parser=etree.XMLParser()), guess_layout=True) == text
This could be handled either by altering traverse_text_fragments to get the tag's local name (using etree.QName), or adding a duplicate of each tag to the NEWLINE_TAGS set that has {http://www.w3.org/1999/xhtml} prepended.