nokolexbor icon indicating copy to clipboard operation
nokolexbor copied to clipboard

Update Lexbor

Open zyc9012 opened this issue 1 month ago • 4 comments

There are two issues with updating the Lexbor commit in our repo

  • Performance is degraded in newer versions of Lexbor, see benchmarks
  • It becomes difficult to implement the ::text selector in newer versions of Lexbor

zyc9012 avatar Nov 12 '25 09:11 zyc9012

Hi @zyc9012

There are two issues with updating the Lexbor commit in our repo

  • Performance is degraded in newer versions of Lexbor, see benchmarks

This needs to be sorted out, strange slump.

  • It becomes difficult to implement the ::text selector in newer versions of Lexbor

And what does the pseudo selector ::text do?

lexborisov avatar Nov 12 '25 11:11 lexborisov

And what does the pseudo selector ::text do?

Basically, it selects the text node.

For example:

<div>text1<span>text2</span></div>

div > ::text selects the "text1" node.

In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.

zyc9012 avatar Nov 12 '25 11:11 zyc9012

Basically, it selects the text node.

For example:

<div>text1<span>text2</span></div>

div > ::text selects the "text1" node.

In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.

In general, this contradicts the specification. Selectors work with elements (ELEMENT_NODE), and text nodes have a different type (TEXT_NODE). Text nodes cannot have attributes. This means that such a pseudo-selector breaks the logic of the query itself.

For example:

div > ::text.myhome

or

div > ::text span

In the example above, we are trying to check the class="myhome" attribute for the text node. Which is contradictory. If we follow this logic, then even at the selector parsing stage, it is necessary to identify such contradictions and throw a parsing error. This would not only contradict the specification, but also complicate the code and significantly affect performance.

lexborisov avatar Nov 12 '25 14:11 lexborisov

In general, this contradicts the specification.

Yes, we know. We are not requiring it to be implemented in Lexbor, but we patched Lexbor to support it in Nokolexbor. This is a solid need in web scraping; without this feature, many extractions won't work.

Previously, we were able to patch Lexbor like this: https://github.com/serpapi/nokolexbor/blob/master/patches/0001-lexbor-support-text-pseudo-element.patch

But with the latest Lexbor code, it doesn't seem easy to do so.

zyc9012 avatar Nov 13 '25 01:11 zyc9012

@zyc9012

Here is a patch that adds ::text. However, I strongly recommend using other approaches.

Update (new patch version):

v2-0001-Selectos-experiment-addition-text.patch

#include <lexbor/html/html.h>
#include <lexbor/css/css.h>
#include <lexbor/selectors/selectors.h>


lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
    printf("%.*s", (int) len, (const char *) data);

    return LXB_STATUS_OK;
}

lxb_status_t
find_callback(lxb_dom_node_t *node, lxb_css_selector_specificity_t spec,
              void *ctx)
{
    unsigned *count = ctx;

    (*count)++;

    printf("%u) ", *count);
    (void) lxb_html_serialize_cb(node, callback, NULL);
    printf("\n");

    return LXB_STATUS_OK;
}

int
main(int argc, const char *argv[])
{
    unsigned count = 0;
    lxb_status_t status;
    lxb_dom_node_t *body;
    lxb_selectors_t *selectors;
    lxb_html_document_t *document;
    lxb_css_parser_t *parser;
    lxb_css_selector_list_t *list;

    /* HTML Data. */

    static const lxb_char_t html[] =
    "<p>"
    "    <span id=s1 span=1>A</span>"
    "    <span id=s2 span=2>B</span>"
    "    <span id=s3 span=3><span>X</span></span>"
    "    <span id=s4 span=4>C</span>"
    "    <span id=s5 span=5>D</span>"
    "</p>";

    /* CSS Data. */

    static const lxb_char_t slctrs[] = "p > span > ::text";

    /* Create HTML Document. */

    document = lxb_html_document_create();
    status = lxb_html_document_parse(document, html,
                                     sizeof(html) / sizeof(lxb_char_t) - 1);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Create CSS parser. */

    parser = lxb_css_parser_create();
    status = lxb_css_parser_init(parser, NULL);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Selectors. */
    selectors = lxb_selectors_create();
    status = lxb_selectors_init(selectors);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Parse and get the log. */

    list = lxb_css_selectors_parse(parser, slctrs,
                                   sizeof(slctrs) / sizeof(lxb_char_t) - 1);
    if (parser->status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Selector List Serialization. */

    printf("Selectors: ");
    (void) lxb_css_selector_serialize_list_chain(list, callback, NULL);
    printf("\n");

    /* Find HTML nodes by CSS Selectors. */

    body = lxb_dom_interface_node(document);

    printf("Found:\n");

    status = lxb_selectors_find(selectors, body, list, find_callback, &count);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Destroy Selectors object. */
    (void) lxb_selectors_destroy(selectors, true);

    /* Destroy resources for CSS Parser. */
    (void) lxb_css_parser_destroy(parser, true);

    /* Destroy all object for all CSS Selector List. */
    lxb_css_selector_list_destroy_memory(list);
    /*
     * for destroy all allocation memory.
     * or use lxb_css_memory_destroy(list->memory, true);
     */

    /* Destroy HTML Document. */
    lxb_html_document_destroy(document);

    return EXIT_SUCCESS;
}

Result:

Selectors: p > span > ::text
Found:
1) A
2) B
3) C
4) D

lexborisov avatar Nov 19 '25 18:11 lexborisov