lexbor icon indicating copy to clipboard operation
lexbor copied to clipboard

Feature request: Option to skip whitespace nodes

Open kostya opened this issue 6 years ago • 7 comments

I compare time and memory usage with myhtml parse 10 times this page: https://www.dropbox.com/s/cq3zonfmsvrcg4k/5.html.gz?dl=0

  • myhtml: 10.52s, 200.9Mb
  • lexbor: 8.75s, 179.9Mb
  • myhtml(TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN): 6.47s, 140.0Mb

so if lexbor can use also this option it would be nice.

kostya avatar Dec 05 '19 02:12 kostya

@kostya

I’ll think about how to do it right.

I will update the lexbor soon (internal work with tag attributes will change) and it will become even faster.

lexborisov avatar Dec 05 '19 08:12 lexborisov

what about adding a benchmark test in source ?

vtorri avatar Dec 05 '19 11:12 vtorri

I will add them later.

lexborisov avatar Dec 05 '19 16:12 lexborisov

btw, why whitespace nodes needed?, i see no reason, only for serialize purposes.

kostya avatar Dec 09 '19 06:12 kostya

@kostya

For example (try in browser):

text<span> </span>space

or

text<div style="display: inline;"> </div>space

or

text<span>
</span>space

UPDATE: and please, see textContent.

lexborisov avatar Dec 09 '19 07:12 lexborisov

If i understand correct, all kind of white space strings, can be replaced with single space (which can be as another type of node, without text storage).

kostya avatar Dec 17 '19 18:12 kostya

This will break many tests. But even that doesn’t matter. The bottom line is that this will not give an increase in speed. The problem is make the token and going through the "circle of hell" in building a tree.

lexborisov avatar Dec 17 '19 18:12 lexborisov

Each user can implement this themselves by modifying the tokenizer callback.

Something like,

lxb_html_tokenizer_callback_token_done_set(tkz, blah_blah_callback, tree);

static lxb_html_token_t *
blah_blah_callback(lxb_html_tokenizer_t *tkz,
                   lxb_html_token_t *token, void *ctx)
{
    lxb_status_t status;

    if (token->tag_id == LXB_TAG__TEXT) {
        /* Here we check whether the string is empty or not. */
        if (ws == true) {
            return token;
        }
    }

    status = lxb_html_tree_insertion_mode(ctx, token);
    if (status != LXB_STATUS_OK) {
        tkz->status = status;
        return NULL;
    }

    return token;
}

lexborisov avatar Aug 19 '23 21:08 lexborisov