myhtml
myhtml copied to clipboard
Inner text of node?
Hi, I am trying to get the inner text of an node.
<a href="http://example-com">Link Name</a>
I tried different means to get the 'Link Name' part, but I always get NULL back.
myhtml_node_text(); // Returns NULL
myhtml_node_string(); // Returns an object with length == 0
myhtml_token_node_text(); // Returns NULL
myhtml_token_node_string(); // Returns an object with length == 0
Ah, never mind.
I had to first get the child node and then get the text with myhtml_node_text().
I am basing my program on some C# code which is why I thought that the node with the tag contained the link name.
But myhtml works a bit different I guess 😄
A C++ wrapper would be nice... just saying.
@Randshot Yea,
<a href="http://example-com">Link Name</a>
created tree
<a href="http://example-com">
-text: Link Name
for get text from <a> node use myhtml_node_child and myhtml_node_text
or use collection
myhtml_collection_t *nodes = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
myhtml_node_text( myhtml_node_child(nodes->list[0]) );
or see serialization functions == innerText in JS
myhtml_serialization_tree_callback(a_node->child, callback, NULL);
// or buffer
mycore_string_raw_t str = {0};
myhtml_serialization_tree_buffer(a_node->child, &str);
or get all the text nodes at once
myhtml_collection_t *nodest= myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG__TEXT, NULL);
myhtml_node_text( nodes->list[0] );
Use Modest for search a nodes by CSS Selectors, see example it's much easier than fingering a tree.
P.S.: Yes, wrapper C ++ is needed, who would do ?!
I have started working on one. My C++ skills aren't the best but it should be sufficient in most cases. For more intense usage, the C-API should used.
Thanks! After done you send me link for your wrapper?
@lexborisov Yeah sure. I plan to implement it as a single header wrapper which has various classes for myhtml. I am still unsure about some design aspects though.
For example, I have a Node class which contains a protected pointer to the myhtml node struct and various methods for reading and modifying the node.
Should I read all node properties when the Node object is initialized or only get the property on demand by using the provided methods (myhtml_node_text)?.
@Randshot You do not need to store data in class. They may become obsolete, this can later cause confusion. I think it should look like this, for example:
node->next();
/* class node... */
next() {
node->next; /* get from C structure or myhtml_node_next(node)*/
}
@Randshot any updates of your wrapper?
@hbakhtiyor I haven't had any time for it lately. I will update you when I have some progress.
Hi, I have a similar issue, I cannot extract text from a
Hi, You can show me HTML pages (html code)?
dump.log This is the google page I've got, exactly what I've pushed to myhtml_parse. myhtml_parse(pCtx->tree, MyENCODING_UTF_8, (char*)html_buffer, html_buffer_size); No error returned. Thanks
Work fine. Code:
myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_SCRIPT, NULL);
for (size_t i = 0; i < collection->length; i++) {
mycore_string_raw_t str = {0};
if(collection->list[i]->child == NULL) {
printf("Oh, God! This not work, I can't believe this is not working\n");
exit(1);
}
myhtml_serialization_tree_buffer(collection->list[i]->child, &str);
printf("%s\n", str.data);
mycore_string_raw_destroy(&str, false);
}
and, we have no get_child_node() function, we have myhtml_node_child() function
Thanks...
Yes, myhtml_node_child(), not get_child_node() (typo) strange... I'm not using collection. And tokenizer_colorize_high_level() seems to work. I Just do the following: myhtml_parse() node = myhtml_node_child() Verify that tag is TAG_HTML. node = myhtml_node_child(node) Verify that TAG is TAG_HEAD node = myhtml_node_child(node) while(node) parse_node(node) node = myhtml_node_next(node)
At some time, my parse_node() function will parse TAG_SCRIPT, and this is where I'm doing the myhtml_node_child(node) -> NULL.
This is maybe linked to myhtml_tree_parse_flags_set(tree, MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN| MyHTML_TREE_PARSE_FLAGS_WITHOUT_DOCTYPE_IN_TREE);
I just tried parse_without_whitespace example, and I see that
I confirm that this is because of MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN.
Is a script a whitespace?
I think there's a bug with MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN flag
myhtml_collection_t *text=myhtml_get_nodes_by_tag_id_in_scope(tree,NULL,classname_list->list[i]->child,MyHTML_TAG__TEXT, NULL);
const char *title=myhtml_node_text(text->list[0],NULL); printf("%s\n",title)
If you want "true" analog of innerText (!= textContent), i have some example: https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L282 written by spec.
If you want more simple textContent - https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L252