Feature request: Option to skip whitespace nodes
I compare time and memory usage with myhtml parse 10 times this page: https://www.dropbox.com/s/cq3zonfmsvrcg4k/5.html.gz?dl=0
- myhtml: 10.52s, 200.9Mb
- lexbor: 8.75s, 179.9Mb
- myhtml(TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN): 6.47s, 140.0Mb
so if lexbor can use also this option it would be nice.
@kostya
I’ll think about how to do it right.
I will update the lexbor soon (internal work with tag attributes will change) and it will become even faster.
what about adding a benchmark test in source ?
I will add them later.
btw, why whitespace nodes needed?, i see no reason, only for serialize purposes.
@kostya
For example (try in browser):
text<span> </span>space
or
text<div style="display: inline;"> </div>space
or
text<span>
</span>space
UPDATE: and please, see textContent.
If i understand correct, all kind of white space strings, can be replaced with single space (which can be as another type of node, without text storage).
This will break many tests. But even that doesn’t matter. The bottom line is that this will not give an increase in speed. The problem is make the token and going through the "circle of hell" in building a tree.
Each user can implement this themselves by modifying the tokenizer callback.
Something like,
lxb_html_tokenizer_callback_token_done_set(tkz, blah_blah_callback, tree);
static lxb_html_token_t *
blah_blah_callback(lxb_html_tokenizer_t *tkz,
lxb_html_token_t *token, void *ctx)
{
lxb_status_t status;
if (token->tag_id == LXB_TAG__TEXT) {
/* Here we check whether the string is empty or not. */
if (ws == true) {
return token;
}
}
status = lxb_html_tree_insertion_mode(ctx, token);
if (status != LXB_STATUS_OK) {
tkz->status = status;
return NULL;
}
return token;
}