Update Lexbor
There are two issues with updating the Lexbor commit in our repo
- Performance is degraded in newer versions of Lexbor, see benchmarks
- It becomes difficult to implement the
::textselector in newer versions of Lexbor
Hi @zyc9012
There are two issues with updating the Lexbor commit in our repo
- Performance is degraded in newer versions of Lexbor, see benchmarks
This needs to be sorted out, strange slump.
- It becomes difficult to implement the
::textselector in newer versions of Lexbor
And what does the pseudo selector ::text do?
And what does the pseudo selector ::text do?
Basically, it selects the text node.
For example:
<div>text1<span>text2</span></div>
div > ::text selects the "text1" node.
In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.
Basically, it selects the text node.
For example:
<div>text1<span>text2</span></div>
div > ::textselects the "text1" node.In previous versions, I was able to patch Lexbor code to support this. But with the new version, I haven't figured out the right way to patch it.
In general, this contradicts the specification. Selectors work with elements (ELEMENT_NODE), and text nodes have a different type (TEXT_NODE). Text nodes cannot have attributes. This means that such a pseudo-selector breaks the logic of the query itself.
For example:
div > ::text.myhome
or
div > ::text span
In the example above, we are trying to check the class="myhome" attribute for the text node. Which is contradictory.
If we follow this logic, then even at the selector parsing stage, it is necessary to identify such contradictions and throw a parsing error. This would not only contradict the specification, but also complicate the code and significantly affect performance.
In general, this contradicts the specification.
Yes, we know. We are not requiring it to be implemented in Lexbor, but we patched Lexbor to support it in Nokolexbor. This is a solid need in web scraping; without this feature, many extractions won't work.
Previously, we were able to patch Lexbor like this: https://github.com/serpapi/nokolexbor/blob/master/patches/0001-lexbor-support-text-pseudo-element.patch
But with the latest Lexbor code, it doesn't seem easy to do so.
@zyc9012
Here is a patch that adds ::text. However, I strongly recommend using other approaches.
Update (new patch version):
v2-0001-Selectos-experiment-addition-text.patch
#include <lexbor/html/html.h>
#include <lexbor/css/css.h>
#include <lexbor/selectors/selectors.h>
lxb_status_t
callback(const lxb_char_t *data, size_t len, void *ctx)
{
printf("%.*s", (int) len, (const char *) data);
return LXB_STATUS_OK;
}
lxb_status_t
find_callback(lxb_dom_node_t *node, lxb_css_selector_specificity_t spec,
void *ctx)
{
unsigned *count = ctx;
(*count)++;
printf("%u) ", *count);
(void) lxb_html_serialize_cb(node, callback, NULL);
printf("\n");
return LXB_STATUS_OK;
}
int
main(int argc, const char *argv[])
{
unsigned count = 0;
lxb_status_t status;
lxb_dom_node_t *body;
lxb_selectors_t *selectors;
lxb_html_document_t *document;
lxb_css_parser_t *parser;
lxb_css_selector_list_t *list;
/* HTML Data. */
static const lxb_char_t html[] =
"<p>"
" <span id=s1 span=1>A</span>"
" <span id=s2 span=2>B</span>"
" <span id=s3 span=3><span>X</span></span>"
" <span id=s4 span=4>C</span>"
" <span id=s5 span=5>D</span>"
"</p>";
/* CSS Data. */
static const lxb_char_t slctrs[] = "p > span > ::text";
/* Create HTML Document. */
document = lxb_html_document_create();
status = lxb_html_document_parse(document, html,
sizeof(html) / sizeof(lxb_char_t) - 1);
if (status != LXB_STATUS_OK) {
return EXIT_FAILURE;
}
/* Create CSS parser. */
parser = lxb_css_parser_create();
status = lxb_css_parser_init(parser, NULL);
if (status != LXB_STATUS_OK) {
return EXIT_FAILURE;
}
/* Selectors. */
selectors = lxb_selectors_create();
status = lxb_selectors_init(selectors);
if (status != LXB_STATUS_OK) {
return EXIT_FAILURE;
}
/* Parse and get the log. */
list = lxb_css_selectors_parse(parser, slctrs,
sizeof(slctrs) / sizeof(lxb_char_t) - 1);
if (parser->status != LXB_STATUS_OK) {
return EXIT_FAILURE;
}
/* Selector List Serialization. */
printf("Selectors: ");
(void) lxb_css_selector_serialize_list_chain(list, callback, NULL);
printf("\n");
/* Find HTML nodes by CSS Selectors. */
body = lxb_dom_interface_node(document);
printf("Found:\n");
status = lxb_selectors_find(selectors, body, list, find_callback, &count);
if (status != LXB_STATUS_OK) {
return EXIT_FAILURE;
}
/* Destroy Selectors object. */
(void) lxb_selectors_destroy(selectors, true);
/* Destroy resources for CSS Parser. */
(void) lxb_css_parser_destroy(parser, true);
/* Destroy all object for all CSS Selector List. */
lxb_css_selector_list_destroy_memory(list);
/*
* for destroy all allocation memory.
* or use lxb_css_memory_destroy(list->memory, true);
*/
/* Destroy HTML Document. */
lxb_html_document_destroy(document);
return EXIT_SUCCESS;
}
Result:
Selectors: p > span > ::text
Found:
1) A
2) B
3) C
4) D