Function to extract every tag available

Open noctera opened this issue 4 years ago • 1 comments

Hello, I have the problem, that I don't know the exact structure of the html files. Therefore I can't provide the whole path with .back.

Let's say I have this html file

<!DOCTYPE html>
<html>
    <body>
        <h1>My First Heading</h1>
        <div>
            <p>My first paragraph.</p>
         </div>
         <p>Another p-Tag</p>
     </body>
</html>

I want to extract every p-tag now. This is working well, as long they are all in the same tag. But when I'm using

for (auto& p : doc.node()["body"].back()["p"])
{
    cout << p.front().text() << std::endl;
}

It is not parsing the p-tag in the div element. So my question is, if there is any function in this lib, that I can parse every p tag regardless of if it is in a lower hierarchy. Unfortunately there is almost not documentation, which makes it a little bit hard to use :)

Have a nice day, Noctera

Jul 31 '21 09:07 noctera

Hello @noctera , I'm extremely happy when knowing that someone is trying to use my projects. 🤣

Unfortunately, there's no such feature at now. This library aimed at my personal usage at first, but even myself gave up parsing HTML with C++. Therefore, the lib is under inactive maintenance.

However, if you insist using this lib, I suggest that implement the feature yourself, as the lib provides the freedom to do that. The easiest implementation should be

template <typename F>
void for_each(html_node const& node, F&& f)
{
    f(node);
    if (node.type() == html_node_type::node)
    {
        for (auto& n : node)
        {
            for_each(n, std::forward<F>(f));
        }
    }
}

and then

for_each(doc.node()["body"].back(), [](html_node const& n) {
    if (n.tag() == "p")
        std::cout << n.front().text() << std::endl;
});

Sep 24 '21 09:09 Berrysoft