unhtml.rs icon indicating copy to clipboard operation
unhtml.rs copied to clipboard

How to do a strict elements deserialization

Open latot opened this issue 8 months ago • 3 comments

Hi! this project is nice!

I has been checking it, and I can't found a way (maybe do not exists) a way to perform a de-serialization in a more strict way.

Actually we can use select and other aspects, but some times we want to also know the actual structure or even attributes are the right one, to know when something in a page changed.

Picking the basic example:

#[macro_use]
extern crate unhtml_derive;
extern crate unhtml;
use unhtml::{self, FromHtml};

#[derive(FromHtml)]
#[html(selector = "#test")]
struct SingleUser {
    #[html(selector = "p:nth-child(1)", attr = "inner")]
    name: String,

    #[html(selector = "p:nth-child(2)", attr = "inner")]
    age: u8,

    #[html(selector = "p:nth-child(3)", attr = "inner")]
    like_lemon: bool,
}

let user = SingleUser::from_html(r#"<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div id="test">
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
        </div>
    </div>
</body>
</html>"#).unwrap();
assert_eq!("Hexilee", &user.name);
assert_eq!(20, user.age);
assert!(user.like_lemon);

Would be nice be able to deserialize something like:

#[derive(FromHtml)]
#[html(selector = "div div.#test")]
// This will force to must only exists one element and must be a div
#[html(tag = "div")
struct SingleUser {
    //The first element must be <a> lets read it
    #[html(tag = "a")]
   // save this attribute in the param
    #[html(attr = "inner")]
    name: String,

    //The second element must be <a> lets read it
    #[html(tag = "a")]
    // read inner attribute and check the value to be 20
    #[html(attr = "inner", expect = 20)]
    age: u8,

    //The third element must be <a>
    //If there is a fourth <a> this should fail
    #[html(tag = "a")]
    #[html(attr = "inner")]
    like_lemon: bool,
}

As you can see, I... invented some parts, the idea is be able to deserialize and the format we specify must be the same as the html, is not like everything must be strict but would be nice be able to perform something like this when is needed.

Some way to check which elements must be contained, with which tags, which attributes, etc, etc.

Thx!

latot avatar Apr 17 '25 19:04 latot

Thanks for your proposal!

Validation is an important feature, it will enhance stability of our parser. However, it's not flexible enough to validate metadata/data in proc macro. Only checking tag is simple, but checking the value or other attributes is complex.

I think this crate can provide metadata in the future to empower users to customize their own validator, like this:

#[derive(FromHtml)]
#[html(selector = "#test")]
struct SingleUser {
    #[html(metadata)]
    meta: DomMeta,

    #[html(selector = "p:nth-child(1)")]
    name: DomNode<String>,

    #[html(selector = "p:nth-child(2)")]
    age: DomNode<u8>,

    #[html(selector = "p:nth-child(3)")]
    like_lemon: DomNode<bool>,
}

// customized
impl SingleUser {
    fn validate(&self) -> Result<(&str, u8, bool)> {
        if self.meta.tag != "div" {
            return anyhow!("user not a 'div'");
        }
        if self.name.tag != "a" {
            return anyhow!("name not a 'a'");
        }
        if self.age.tag != "a" || *self.age != 20 {
            return anyhow!("age not a 'a', or inner value is not 20");
        }
        if self.like_lemon.tag != "a" {
            return anyhow!("like_lemon not a 'a'");
        }
        Ok((self.name.as_str(), *self.age, *self.like_lemon))
    }
}

let user = SingleUser::from_html(r#"<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div id="test">
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
        </div>
    </div>
</body>
</html>"#).unwrap();
let (name, age, like_lemon) = user.validate().unwrap();

Hexilee avatar Apr 18 '25 03:04 Hexilee

:O Thx!

That can helps to validate some structures, there is two aspect I'm still wondering.

How can I make it fail with:

<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <div id="test">
        <div>
            <p>Hexilee</p>
            <p>20</p>
            <p>true</p>
            <p>other thing<p>
        </div>
    </div>
</body>
</html>

The actual solution pick the the childs from 1 to 3, which is right, how can we limit to only exists that three? in the example above the have 4

, so we need a way to validate the structure roughly.

Other thing, IIRC is not in the docs, how can we treat strings? Strings alone like:

<p>true</p>
String
<p>other thing<p>

Are not Elements, are nodes, and for today there is no way to use selectors with them today with scrapper, this could be related to the README that do not explain how works #[html()] without a selector, it will pick node and then continue to the next one?

Thx!

latot avatar Apr 18 '25 20:04 latot

Sorry for my late reply.

For the first case, the additional p may not be the direct child of our target div, it's complex to provide a common validator to cover the grandchildren or grand-grand-children. So, I think we should add an additional field like:

#[derive(FromHtml)]
#[html(selector = "#test")]
struct SingleUser {
    #[html(metadata)]
    meta: DomMeta,

    #[html(selector = "p", attr = "inner")]
    all_text_nodes: Vec<String>,

    #[html(selector = "p:nth-child(1)")]
    name: DomNode<String>,

    #[html(selector = "p:nth-child(2)")]
    age: DomNode<u8>,

    #[html(selector = "p:nth-child(3)")]
    like_lemon: DomNode<bool>,
}

Then we can validate the length of all_text_nodes.

For the second case, a pure text part is the inner value of its parent node, we should select its parent and use attr = "inner" to refer its value.

Hexilee avatar May 05 '25 09:05 Hexilee