lexbor Feature request: Option to skip whitespace nodes

I compare time and memory usage with myhtml parse 10 times this page: https://www.dropbox.com/s/cq3zonfmsvrcg4k/5.html.gz?dl=0

myhtml: 10.52s, 200.9Mb
lexbor: 8.75s, 179.9Mb
myhtml(TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN): 6.47s, 140.0Mb

so if lexbor can use also this option it would be nice.

Dec 05 '19 02:12 kostya

@kostya

I’ll think about how to do it right.

I will update the lexbor soon (internal work with tag attributes will change) and it will become even faster.

Dec 05 '19 08:12 lexborisov

what about adding a benchmark test in source ?

Dec 05 '19 11:12 vtorri

I will add them later.

Dec 05 '19 16:12 lexborisov

btw, why whitespace nodes needed?, i see no reason, only for serialize purposes.

Dec 09 '19 06:12 kostya

@kostya

For example (try in browser):

text<span> </span>space

or

text<div style="display: inline;"> </div>space

or

text<span>
</span>space

UPDATE: and please, see textContent.

Dec 09 '19 07:12 lexborisov

If i understand correct, all kind of white space strings, can be replaced with single space (which can be as another type of node, without text storage).

Dec 17 '19 18:12 kostya

This will break many tests. But even that doesn’t matter. The bottom line is that this will not give an increase in speed. The problem is make the token and going through the "circle of hell" in building a tree.

Dec 17 '19 18:12 lexborisov

Each user can implement this themselves by modifying the tokenizer callback.

Something like,

lxb_html_tokenizer_callback_token_done_set(tkz, blah_blah_callback, tree);

static lxb_html_token_t *
blah_blah_callback(lxb_html_tokenizer_t *tkz,
                   lxb_html_token_t *token, void *ctx)
{
    lxb_status_t status;

    if (token->tag_id == LXB_TAG__TEXT) {
        /* Here we check whether the string is empty or not. */
        if (ws == true) {
            return token;
        }
    }

    status = lxb_html_tree_insertion_mode(ctx, token);
    if (status != LXB_STATUS_OK) {
        tkz->status = status;
        return NULL;
    }

    return token;
}

Aug 19 '23 21:08 lexborisov