nokolexbor Update Lexbor

There are two issues with updating the Lexbor commit in our repo

Performance is degraded in newer versions of Lexbor, see benchmarks
It becomes difficult to implement the ::text selector in newer versions of Lexbor

Nov 12 '25 09:11 zyc9012

Hi @zyc9012

There are two issues with updating the Lexbor commit in our repo

Performance is degraded in newer versions of Lexbor, see benchmarks

This needs to be sorted out, strange slump.

It becomes difficult to implement the ::text selector in newer versions of Lexbor

And what does the pseudo selector ::text do?

Nov 12 '25 11:11 lexborisov

And what does the pseudo selector ::text do?

Basically, it selects the text node.

For example:

<div>text1<span>text2</span></div>

div > ::text selects the "text1" node.

In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.

Nov 12 '25 11:11 zyc9012

Basically, it selects the text node.

For example:
<div>text1<span>text2</span></div>
div > ::text selects the "text1" node.

In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.

In general, this contradicts the specification. Selectors work with elements (ELEMENT_NODE), and text nodes have a different type (TEXT_NODE). Text nodes cannot have attributes. This means that such a pseudo-selector breaks the logic of the query itself.

For example:

div > ::text.myhome

or

div > ::text span

In the example above, we are trying to check the class="myhome" attribute for the text node. Which is contradictory. If we follow this logic, then even at the selector parsing stage, it is necessary to identify such contradictions and throw a parsing error. This would not only contradict the specification, but also complicate the code and significantly affect performance.

Nov 12 '25 14:11 lexborisov

In general, this contradicts the specification.

Yes, we know. We are not requiring it to be implemented in Lexbor, but we patched Lexbor to support it in Nokolexbor. This is a solid need in web scraping; without this feature, many extractions won't work.

Previously, we were able to patch Lexbor like this: https://github.com/serpapi/nokolexbor/blob/master/patches/0001-lexbor-support-text-pseudo-element.patch

But with the latest Lexbor code, it doesn't seem easy to do so.

Nov 13 '25 01:11 zyc9012

@zyc9012

Here is a patch that adds ::text. However, I strongly recommend using other approaches.

Update (new patch version):

v2-0001-Selectos-experiment-addition-text.patch

#include <lexbor/html/html.h>
#include <lexbor/css/css.h>
#include <lexbor/selectors/selectors.h>


lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
    printf("%.*s", (int) len, (const char *) data);

    return LXB_STATUS_OK;
}

lxb_status_t
find_callback(lxb_dom_node_t *node, lxb_css_selector_specificity_t spec,
              void *ctx)
{
    unsigned *count = ctx;

    (*count)++;

    printf("%u) ", *count);
    (void) lxb_html_serialize_cb(node, callback, NULL);
    printf("\n");

    return LXB_STATUS_OK;
}

int
main(int argc, const char *argv[])
{
    unsigned count = 0;
    lxb_status_t status;
    lxb_dom_node_t *body;
    lxb_selectors_t *selectors;
    lxb_html_document_t *document;
    lxb_css_parser_t *parser;
    lxb_css_selector_list_t *list;

    /* HTML Data. */

    static const lxb_char_t html[] =
    "<p>"
    "    <span id=s1 span=1>A</span>"
    "    <span id=s2 span=2>B</span>"
    "    <span id=s3 span=3><span>X</span></span>"
    "    <span id=s4 span=4>C</span>"
    "    <span id=s5 span=5>D</span>"
    "</p>";

    /* CSS Data. */

    static const lxb_char_t slctrs[] = "p > span > ::text";

    /* Create HTML Document. */

    document = lxb_html_document_create();
    status = lxb_html_document_parse(document, html,
                                     sizeof(html) / sizeof(lxb_char_t) - 1);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Create CSS parser. */

    parser = lxb_css_parser_create();
    status = lxb_css_parser_init(parser, NULL);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Selectors. */
    selectors = lxb_selectors_create();
    status = lxb_selectors_init(selectors);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Parse and get the log. */

    list = lxb_css_selectors_parse(parser, slctrs,
                                   sizeof(slctrs) / sizeof(lxb_char_t) - 1);
    if (parser->status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Selector List Serialization. */

    printf("Selectors: ");
    (void) lxb_css_selector_serialize_list_chain(list, callback, NULL);
    printf("\n");

    /* Find HTML nodes by CSS Selectors. */

    body = lxb_dom_interface_node(document);

    printf("Found:\n");

    status = lxb_selectors_find(selectors, body, list, find_callback, &count);
    if (status != LXB_STATUS_OK) {
        return EXIT_FAILURE;
    }

    /* Destroy Selectors object. */
    (void) lxb_selectors_destroy(selectors, true);

    /* Destroy resources for CSS Parser. */
    (void) lxb_css_parser_destroy(parser, true);

    /* Destroy all object for all CSS Selector List. */
    lxb_css_selector_list_destroy_memory(list);
    /*
     * for destroy all allocation memory.
     * or use lxb_css_memory_destroy(list->memory, true);
     */

    /* Destroy HTML Document. */
    lxb_html_document_destroy(document);

    return EXIT_SUCCESS;
}

Result:

Selectors: p > span > ::text
Found:
1) A
2) B
3) C
4) D

Nov 19 '25 18:11 lexborisov