selectolax
selectolax copied to clipboard
Text nodes not displayed with `deep=True`
Hello, I am testing this toy example
from selectolax.parser import HTMLParser
html_str = """
<html>
<body>
<div>this is a test
<h1>Heading</h1>
</div>
</body>
</html>
"""
selectolax_tree = HTMLParser(html_str)
for node in selectolax_tree.root.traverse(include_text=True):
print(f"Node tag: {node.tag}")
if node.tag == "-text":
print(f"Node text: {node.text(deep=True)}")
print("-------")
which outputs
Node tag: html
-------
Node tag: head
-------
Node tag: body
-------
Node tag: -text
Node text:
-------
Node tag: div
-------
Node tag: -text
Node text:
-------
Node tag: p
-------
Node tag: -text
Node text: Heading
-------
Node tag: -text
Node text:
-------
Node tag: -text
Node text:
-------
so the text node this is a test
is not displayed.
If instead I write node.text(deep=False)
, now the text this is a test
is displayed.
This behavior is not present if I remove the h1
tag and the text this is a test
is displayed anyway, with deep=True
or deep=False
.
Any idea why?
I've pushed a fix, but It needs more tests since it can alter behaviour for other use cases. Basically, deep extraction was not working when:
- We start from a text node
- There is no child node
- There is a next node (the h1 tag in your case).
I've made a new release.