selectolax
selectolax copied to clipboard
Node.text() does not respect changes from Node.unwrap_tags
It seems that node.text() does not respect the mutated node after calling node.unwrap_tags. I'd expect the following to pass
from selectolax.parser import HTMLParser
tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
tree.unwrap_tags(["strong"])
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"
but instead I get
AssertionError: J ohn Doe != John Doe
I have quite the ugly workaround currently (parsing the html twice)
pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])
tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"
That happens because you get two text nodes close to each other when removing the strong tag.
for node in tree.root.traverse(include_text=True):
print(node.text(deep=False), node.tag)
html
head
body
div
John p
J -text
ohn -text
Doe p
Doe -text
I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes.
I get the same behavior in Chrome:


Ahh I see. I guess I'd need to iterate through the node and merge all -text siblings before. Not sure if this should be part of node.text?
I think adding a separate merge_text_nodes method will be sufficient.
Added a PoC: https://github.com/rushter/selectolax/blob/abc9a3c2115c4d86108be1df182cdf4d38460da1/tests/test_nodes.py#L537
It needs more tests.
I've made a new release that supports merge_text_nodes.
@rushter you're a legend !
Shouldn't it be added to parser.pyi as well?
Yes, I've updated it. Thank you for checking it.
I'm afraid there is a memory issue with this thing:
python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug
This is what I apply to some html
def html_to_text(html: str):
tree = HTMLParser(html)
tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
tree.strip_tags(DEFAULT_REMOVE_TAGS)
body = tree.body
if body is None:
raise Exception("No body")
body.merge_text_nodes()
return body.text(separator="\n", strip=True)
I'm running on Macbook Pro M1 - not sure if that is relevant
I'm afraid there is a memory issue with this thing:
python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug
This is what I apply to some html
def html_to_text(html: str): tree = HTMLParser(html) tree.unwrap_tags(DEFAULT_UNWRAP_TAGS) tree.strip_tags(DEFAULT_REMOVE_TAGS) body = tree.body if body is None: raise Exception("No body") body.merge_text_nodes() return body.text(separator="\n", strip=True)I'm running on Macbook Pro M1 - not sure if that is relevant
Is it possible to provide the HTML that crashes it?
Is it possible to provide the HTML that crashes it?
Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times.
Ok, please send HTML to me.
I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference
import json
from selectolax.parser import HTMLParser
DEFAULT_UNWRAP_TAGS = [
"a",
"abbr",
"acronym",
"b",
"bdo",
"big",
"br",
"button",
"cite",
"code",
"dfn",
"em",
"i",
"img",
"input",
"kbd",
"label",
"map",
"object",
"output",
"q",
"samp",
"script",
"select",
"small",
"span",
"strong",
"textarea",
"time",
"tt",
"var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]
with open("html.json", "r") as f:
data = json.load(f)
html = data[0]["html"]
def html_to_text(html: str) -> str:
tree = HTMLParser(html)
tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
tree.strip_tags(DEFAULT_REMOVE_TAGS)
body = tree.body
if body is None:
raise Exception("No body")
body.merge_text_nodes()
return body.text(separator="\n", strip=True)
for i in range(300):
print(i)
html_to_text(html)
I think I fixed it. Can you please double check by clonning the git repo and installing dev version?
I think I fixed it. Can you please double check by clonning the git repo and installing dev version?
I agree. It works !