selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

Text nodes not displayed with `deep=True`

Open HugoLaurencon opened this issue 2 years ago • 1 comments

Hello, I am testing this toy example

from selectolax.parser import HTMLParser

html_str = """
<html>
<body>
<div>this is a test
    <h1>Heading</h1>
</div>
</body>
</html>
"""

selectolax_tree = HTMLParser(html_str)
for node in selectolax_tree.root.traverse(include_text=True):
    print(f"Node tag: {node.tag}")
    if node.tag == "-text":
        print(f"Node text: {node.text(deep=True)}")
    print("-------")

which outputs

Node tag: html
-------
Node tag: head
-------
Node tag: body
-------
Node tag: -text
Node text: 
-------
Node tag: div
-------
Node tag: -text
Node text: 
-------
Node tag: p
-------
Node tag: -text
Node text: Heading
-------
Node tag: -text
Node text: 

-------
Node tag: -text
Node text: 



-------

so the text node this is a test is not displayed.

If instead I write node.text(deep=False), now the text this is a test is displayed.

This behavior is not present if I remove the h1 tag and the text this is a test is displayed anyway, with deep=True or deep=False.

Any idea why?

HugoLaurencon avatar Jul 12 '22 20:07 HugoLaurencon

I've pushed a fix, but It needs more tests since it can alter behaviour for other use cases. Basically, deep extraction was not working when:

  1. We start from a text node
  2. There is no child node
  3. There is a next node (the h1 tag in your case).

rushter avatar Jul 26 '22 14:07 rushter

I've made a new release.

rushter avatar Aug 04 '22 11:08 rushter