selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

Node.text() does not respect changes from Node.unwrap_tags

Open NixBiks opened this issue 3 years ago • 6 comments

It seems that node.text() does not respect the mutated node after calling node.unwrap_tags. I'd expect the following to pass

from selectolax.parser import HTMLParser


tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
tree.unwrap_tags(["strong"])
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

but instead I get

AssertionError: J ohn Doe  != John Doe

NixBiks avatar Sep 06 '22 15:09 NixBiks

I have quite the ugly workaround currently (parsing the html twice)

pre_tree = tree = HTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
pre_tree.unwrap_tags(["strong"])

tree = HTMLParser(pre_tree.html)
node = tree.css_first("div", strict=True)
assert node.html == "<div><p>John</p><p>Doe</p></div>"
text = tree.text(deep=True, separator=" ", strip=True)
assert text == "John Doe", f"{text} != John Doe"

NixBiks avatar Sep 06 '22 15:09 NixBiks

That happens because you get two text nodes close to each other when removing the strong tag.

for node in tree.root.traverse(include_text=True):
    print(node.text(deep=False), node.tag)
 html
 head
 body
 div
John p
J -text
ohn -text
Doe p
Doe -text

I'm not sure what to do about it yet. Technically, I think this behaviour is correct, but for the majority of users it would be unexpected. To make it work, we need to merge two text nodes.

rushter avatar Sep 07 '22 04:09 rushter

I get the same behavior in Chrome: image

image

rushter avatar Sep 07 '22 05:09 rushter

Ahh I see. I guess I'd need to iterate through the node and merge all -text siblings before. Not sure if this should be part of node.text?

NixBiks avatar Sep 07 '22 07:09 NixBiks

I think adding a separate merge_text_nodes method will be sufficient.

rushter avatar Sep 07 '22 13:09 rushter

Added a PoC: https://github.com/rushter/selectolax/blob/abc9a3c2115c4d86108be1df182cdf4d38460da1/tests/test_nodes.py#L537

It needs more tests.

rushter avatar Sep 07 '22 19:09 rushter

I've made a new release that supports merge_text_nodes.

rushter avatar Sep 20 '22 06:09 rushter

@rushter you're a legend !

NixBiks avatar Sep 20 '22 06:09 NixBiks

Shouldn't it be added to parser.pyi as well?

NixBiks avatar Sep 20 '22 07:09 NixBiks

Yes, I've updated it. Thank you for checking it.

rushter avatar Sep 20 '22 08:09 rushter

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

NixBiks avatar Sep 20 '22 08:09 NixBiks

I'm afraid there is a memory issue with this thing:

python(94018,0x101508580) malloc: Incorrect checksum for freed object 0x129e29870: probably modified after being freed. Corrupt value: 0xb0000000129e3f00 python(94018,0x101508580) malloc: *** set a breakpoint in malloc_error_break to debug

This is what I apply to some html

def html_to_text(html: str):
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

I'm running on Macbook Pro M1 - not sure if that is relevant

Is it possible to provide the HTML that crashes it?

rushter avatar Sep 20 '22 08:09 rushter

Is it possible to provide the HTML that crashes it?

Yes but it is inconstent. I can send you a JSON file with 300'ish HTML. When I loop through then it crashes at random times.

NixBiks avatar Sep 20 '22 09:09 NixBiks

Ok, please send HTML to me.

rushter avatar Sep 20 '22 10:09 rushter

I've just sent you a JSON with a array with only one object that has html key. If I run the code on that html multiple times then it breaks. It's random how many times. In the script I sent it is running over the html 300 times. Script here for reference

import json
from selectolax.parser import HTMLParser


DEFAULT_UNWRAP_TAGS = [
    "a",
    "abbr",
    "acronym",
    "b",
    "bdo",
    "big",
    "br",
    "button",
    "cite",
    "code",
    "dfn",
    "em",
    "i",
    "img",
    "input",
    "kbd",
    "label",
    "map",
    "object",
    "output",
    "q",
    "samp",
    "script",
    "select",
    "small",
    "span",
    "strong",
    "textarea",
    "time",
    "tt",
    "var",
]
DEFAULT_REMOVE_TAGS = ["sub", "sup", "table"]


with open("html.json", "r") as f:
    data = json.load(f)

html = data[0]["html"]

def html_to_text(html: str) -> str:
    tree = HTMLParser(html)
    tree.unwrap_tags(DEFAULT_UNWRAP_TAGS)
    tree.strip_tags(DEFAULT_REMOVE_TAGS)
    body = tree.body
    if body is None:
        raise Exception("No body")
    body.merge_text_nodes()
    return body.text(separator="\n", strip=True)

for i in range(300):
    print(i)
    html_to_text(html)

NixBiks avatar Sep 20 '22 10:09 NixBiks

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

rushter avatar Sep 20 '22 19:09 rushter

I think I fixed it. Can you please double check by clonning the git repo and installing dev version?

I agree. It works !

NixBiks avatar Sep 21 '22 09:09 NixBiks