selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

Return an error when trying to decompose a node with `html` tag

Open HugoLaurencon opened this issue 1 year ago • 4 comments

Hello, I believe it would be beneficial to generate an error message when we execute the command node.decompose() on a node that has the html tag. Currently, the procedure freezes and does not produce any error messages.

When dealing with a large number of HTML files, there are often some atypical ones containing an attribute we want to remove nodes with directly in the main HTML node. However, without any error messages, it is challenging to identify the source of the problem, making debugging a difficult process.

HugoLaurencon avatar Apr 28 '23 00:04 HugoLaurencon

Please provide an example that hangs.

import selectolax.parser

html = "<body><div></div></body>"
html_parser = selectolax.parser.HTMLParser(html)
print(html_parser.root.decompose())

Works fine for me (.root is the <html> element).

rushter avatar Apr 29 '23 16:04 rushter

Sure, here is my example that results in an infinite loop

from selectolax.parser import HTMLParser

html_str = """
<!DOCTYPE html>
<html class="site-info">
</html>
"""

def _remove_nodes_matching_css_rules(selectolax_tree):
    modification = True
    while modification:
        found_a_node = False
        for node in selectolax_tree.css("[class~='site-info']"):
            node.decompose()
            found_a_node = True
            break
        if not found_a_node:
            modification = False
    return selectolax_tree


selectolax_tree = HTMLParser(html_str)

selectolax_tree = _remove_nodes_matching_css_rules(
    selectolax_tree=selectolax_tree,
)

Actually you're right that we can decompose the html node, but then there is an infinite loop because I think the attributes of the html node are kept after calling the decompose operation

HugoLaurencon avatar Apr 29 '23 22:04 HugoLaurencon

Your code will work if you switch to lexbor backed.

Why do you need to remove the html tag? It's essential for any document and gets automatically created even if you don't provide it.

@lexborisov What's the best way to handle this? I'm using myhtml_tree_node_remove. Can I just set myhtml_tree_t->node_html manually? The result of remove is not propagated to the myhtml_tree_t structure because we are trying to remove the root node and myhtml_tree_t points to the root node already.

rushter avatar Apr 30 '23 10:04 rushter

@rushter

What's the best way to handle this?

I don't understand why I have to remove the html node at all? But never mind.

Can I just set myhtml_tree_t->node_html manually?

Yes, you can.

lexborisov avatar Apr 30 '23 13:04 lexborisov