selectolax Node.text() does not respect changes from Node.unwrap

It seems that node.text() does not respect the mutated node after calling node.unwrap_tags. I'd expect the following to pass

from selectolax.parser import HTMLParser


tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
tree.unwrap_tags(["strong"])
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

but instead I get

AssertionError: J ohn Doe  != John Doe

Sep 06 '22 15:09 NixBiks

I have quite the ugly workaround currently (parsing the html twice)

pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])

tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

Sep 06 '22 15:09 NixBiks

That happens because you get two text nodes close to each other when removing the strong tag.

for node in tree.root.traverse(include_text=True):
    print(node.text(deep=False), node.tag)
 html
 head
 body
 div
John p
J -text
ohn -text
Doe p
Doe -text

I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes.

Sep 07 '22 04:09 rushter

I get the same behavior in Chrome:

Sep 07 '22 05:09 rushter

Ahh I see. I guess I'd need to iterate through the node and merge all -text siblings before. Not sure if this should be part of node.text?

Sep 07 '22 07:09 NixBiks

I think adding a separate merge_text_nodes method will be sufficient.

Sep 07 '22 13:09 rushter

Added a PoC: https://github.com/rushter/selectolax/blob/abc9a3c2115c4d86108be1df182cdf4d38460da1/tests/test_nodes.py#L537

It needs more tests.

Sep 07 '22 19:09 rushter

I've made a new release that supports merge_text_nodes.

Sep 20 '22 06:09 rushter

@rushter you're a legend !

Sep 20 '22 06:09 NixBiks

Shouldn't it be added to parser.pyi as well?

Sep 20 '22 07:09 NixBiks

Yes, I've updated it. Thank you for checking it.

Sep 20 '22 08:09 rushter

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

Sep 20 '22 08:09 NixBiks

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html
def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)
I'm running on Macbook Pro M1 - not sure if that is relevant

Is it possible to provide the HTML that crashes it?

Sep 20 '22 08:09 rushter

Is it possible to provide the HTML that crashes it?

Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times.

Sep 20 '22 09:09 NixBiks

Ok, please send HTML to me.

Sep 20 '22 10:09 rushter

I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference

import json
from selectolax.parser import HTMLParser


DEFAULT_UNWRAP_TAGS = [
    "a",
    "abbr",
    "acronym",
    "b",
    "bdo",
    "big",
    "br",
    "button",
    "cite",
    "code",
    "dfn",
    "em",
    "i",
    "img",
    "input",
    "kbd",
    "label",
    "map",
    "object",
    "output",
    "q",
    "samp",
    "script",
    "select",
    "small",
    "span",
    "strong",
    "textarea",
    "time",
    "tt",
    "var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]


with open("html.json", "r") as f:
    data = json.load(f)

html = data[0]["html"]

def html_to_text(html: str) -> str:
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

for i in range(300):
    print(i)
    html_to_text(html)

Sep 20 '22 10:09 NixBiks

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

Sep 20 '22 19:09 rushter

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

I agree. It works !

Sep 21 '22 09:09 NixBiks

selectolax
selectolax copied to clipboard

Node.text() does not respect changes from Node.unwrap_tags

selectolax selectolax copied to clipboard

Node.text() does not respect changes from Node.unwrap_tags

selectolax
selectolax copied to clipboard