selectolax
selectolax copied to clipboard
Return an error when trying to decompose a node with `html` tag
Hello, I believe it would be beneficial to generate an error message when we execute the command node.decompose()
on a node that has the html
tag. Currently, the procedure freezes and does not produce any error messages.
When dealing with a large number of HTML files, there are often some atypical ones containing an attribute we want to remove nodes with directly in the main HTML node. However, without any error messages, it is challenging to identify the source of the problem, making debugging a difficult process.
Please provide an example that hangs.
import selectolax.parser
html = "<body><div></div></body>"
html_parser = selectolax.parser.HTMLParser(html)
print(html_parser.root.decompose())
Works fine for me (.root
is the <html>
element).
Sure, here is my example that results in an infinite loop
from selectolax.parser import HTMLParser
html_str = """
<!DOCTYPE html>
<html class="site-info">
</html>
"""
def _remove_nodes_matching_css_rules(selectolax_tree):
modification = True
while modification:
found_a_node = False
for node in selectolax_tree.css("[class~='site-info']"):
node.decompose()
found_a_node = True
break
if not found_a_node:
modification = False
return selectolax_tree
selectolax_tree = HTMLParser(html_str)
selectolax_tree = _remove_nodes_matching_css_rules(
selectolax_tree=selectolax_tree,
)
Actually you're right that we can decompose the html node, but then there is an infinite loop because I think the attributes of the html node are kept after calling the decompose operation
Your code will work if you switch to lexbor
backed.
Why do you need to remove the html
tag? It's essential for any document and gets automatically created even if you don't provide it.
@lexborisov What's the best way to handle this? I'm using myhtml_tree_node_remove
. Can I just set myhtml_tree_t->node_html
manually? The result of remove is not propagated to the myhtml_tree_t
structure because we are trying to remove the root node and myhtml_tree_t
points to the root node already.
@rushter
What's the best way to handle this?
I don't understand why I have to remove the html node at all? But never mind.
Can I just set myhtml_tree_t->node_html manually?
Yes, you can.