scraper icon indicating copy to clipboard operation
scraper copied to clipboard

Get text of element without childers

Open latot opened this issue 1 year ago • 4 comments

Hi all, I was trying to get some text from a element, then I noticed that seems there is no way to get the inner html of a element excluding the childrens, for example:

<div>
  <p>Hi</p>
  Bye
</div>

Get the Bye is pretty hard..., Childrens are only html tags, so Bye can't be detected easily, nor excluded from a iterator.

I don't think the best solution is just return a String, because:

<div>
  Bye2
  <p>Hi</p>
  Bye
</div>

Would be hard to know how to get well things like that ones... I don't know which solution would be good and clear for this.... Maybe an iterator that in their childrens, the text are also childrens, so we can work on it.

Thx!

latot avatar Aug 12 '24 19:08 latot

<div>
  <p>Hi</p>
  Bye
</div>

parses into the following tree:

Fragment
└── Element(<html>)
    ├── Text("\n        ")
    ├── Element(<div>)
    │   ├── Text("\n        ")
    │   ├── Element(<p>)
    │   │   └── Text("Hi")
    │   └── Text("\n            Bye\n        ")
    └── Text("\n    ")

So if you want Bye you can just iterate on the children of the <div> element with type Text

Something like this should work:

    let html = r#"
        <div>
        <p>Hi</p>
            Bye
        </div>
    "#;

    let fragment = Html::parse_fragment(html);
    let bye: Vec<&Node> = fragment.tree
        // Navigate to the node of your interest
        .root()
        .first_child().unwrap()
        .first_child().unwrap()
        .next_sibling().unwrap()
        // Iterate over the children
        .children()
        .filter(|child| matches!(child.value(), Node::Text(_)))
        .map(|noderef| noderef.value())
        .collect();
    println!("{:#?}", bye);

LoZack19 avatar Aug 23 '24 20:08 LoZack19

weird, let me look on this in some time next week, when I iterate over childs, there was no Text elements.

latot avatar Aug 23 '24 20:08 latot

Of course, don't worry. For anything just write. By the way, the code I gave you is tested, so it should work.

LoZack19 avatar Aug 23 '24 20:08 LoZack19

Note that you can combine the filter and map calls into filter_map(|child| child.value().as_text()) if you only want the Text without the surrounding Node.

adamreichold avatar Aug 24 '24 06:08 adamreichold

Since this issue seems solved, I will close it.

LoZack19 avatar Sep 07 '24 11:09 LoZack19

@LoZack19 Sorry, I was very busy, I was able to check this.

Here is a reprex where this happens:

use scraper::Html;
use scraper::Selector;

fn main() {
    let html = r#"
        <div>
        <p>Hi</p>
            Bye
        </div>
    "#;

    let fragment = Html::parse_fragment(html);
    let selector = Selector::parse("div").unwrap();

    let div = fragment.select(&selector).last().unwrap();

    let childs: Vec<scraper::ElementRef<'_>> = div.child_elements().collect();

    for element in childs {
        println!("{}", element.html())
    }
}

The return is:

<p>Hi</p>

On the childs there is missing the Text nodes, maybe I'm being confused with something (? I think would be expected to have all the nodes on the childrens.

latot avatar Sep 23 '24 15:09 latot

let childs: Vec<scraper::ElementRef<'_>> = div.child_elements().collect();

You are explicitly collecting only child elements, not nodes. The text are represented by Node::Text which is explicitly not an Element.

adamreichold avatar Sep 23 '24 15:09 adamreichold

Which would be the actual definition for each one?

If we follow the docs, a Element is "An HTML element.", so Text are not element elements. While the relation between Element and Tags is not defined. Maybe update the docs and define Element as a Html Tag Element would be clear.

I think would be good, on the main docs of scraper clarify this concepts, just to get an idea of what represents each of them, ideally also include the deref part, is good to know, just not intuitive *element = node.

This details would help a lot :)

latot avatar Sep 23 '24 15:09 latot