myhtml icon indicating copy to clipboard operation
myhtml copied to clipboard

Inner text of node?

Open no-realm opened this issue 8 years ago • 19 comments

Hi, I am trying to get the inner text of an node.

<a href="http://example-com">Link Name</a>

I tried different means to get the 'Link Name' part, but I always get NULL back.

myhtml_node_text(); // Returns NULL
myhtml_node_string(); // Returns an object with length == 0
myhtml_token_node_text(); // Returns NULL
myhtml_token_node_string(); // Returns an object with length == 0

no-realm avatar Apr 15 '17 01:04 no-realm

Ah, never mind. I had to first get the child node and then get the text with myhtml_node_text(). I am basing my program on some C# code which is why I thought that the node with the tag contained the link name.

But myhtml works a bit different I guess 😄 A C++ wrapper would be nice... just saying.

no-realm avatar Apr 15 '17 01:04 no-realm

@Randshot Yea,

<a href="http://example-com">Link Name</a>

created tree

<a href="http://example-com">
    -text: Link Name

for get text from <a> node use myhtml_node_child and myhtml_node_text or use collection

myhtml_collection_t *nodes = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
myhtml_node_text( myhtml_node_child(nodes->list[0]) );

or see serialization functions == innerText in JS

myhtml_serialization_tree_callback(a_node->child, callback, NULL);
// or buffer
mycore_string_raw_t str = {0};
myhtml_serialization_tree_buffer(a_node->child, &str);

see example

or get all the text nodes at once

myhtml_collection_t *nodest= myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG__TEXT, NULL);
myhtml_node_text( nodes->list[0] );

Use Modest for search a nodes by CSS Selectors, see example it's much easier than fingering a tree.

P.S.: Yes, wrapper C ++ is needed, who would do ?!

lexborisov avatar Apr 15 '17 07:04 lexborisov

I have started working on one. My C++ skills aren't the best but it should be sufficient in most cases. For more intense usage, the C-API should used.

no-realm avatar Apr 15 '17 18:04 no-realm

Thanks! After done you send me link for your wrapper?

lexborisov avatar Apr 15 '17 18:04 lexborisov

@lexborisov Yeah sure. I plan to implement it as a single header wrapper which has various classes for myhtml. I am still unsure about some design aspects though.

For example, I have a Node class which contains a protected pointer to the myhtml node struct and various methods for reading and modifying the node. Should I read all node properties when the Node object is initialized or only get the property on demand by using the provided methods (myhtml_node_text)?.

no-realm avatar Apr 15 '17 20:04 no-realm

@Randshot You do not need to store data in class. They may become obsolete, this can later cause confusion. I think it should look like this, for example:

node->next();
/* class node... */
next() {
node->next; /* get from C structure or  myhtml_node_next(node)*/
}

lexborisov avatar Apr 15 '17 20:04 lexborisov

@Randshot any updates of your wrapper?

hbakhtiyor avatar May 03 '17 03:05 hbakhtiyor

@hbakhtiyor I haven't had any time for it lately. I will update you when I have some progress.

no-realm avatar May 03 '17 09:05 no-realm

Hi, I have a similar issue, I cannot extract text from a

fariouche avatar Jan 12 '18 09:01 fariouche

Hi, You can show me HTML pages (html code)?

lexborisov avatar Jan 12 '18 09:01 lexborisov

dump.log This is the google page I've got, exactly what I've pushed to myhtml_parse. myhtml_parse(pCtx->tree, MyENCODING_UTF_8, (char*)html_buffer, html_buffer_size); No error returned. Thanks

fariouche avatar Jan 12 '18 10:01 fariouche

Work fine. Code:

    myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
    myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_SCRIPT, NULL);
    
    for (size_t i = 0; i < collection->length; i++) {
        mycore_string_raw_t str = {0};
        if(collection->list[i]->child == NULL) {
            printf("Oh, God! This not work, I can't believe this is not working\n");
            exit(1);
        }
        
        myhtml_serialization_tree_buffer(collection->list[i]->child, &str);
        
        printf("%s\n", str.data);
        
        mycore_string_raw_destroy(&str, false);
    }

lexborisov avatar Jan 12 '18 10:01 lexborisov

and, we have no get_child_node() function, we have myhtml_node_child() function

lexborisov avatar Jan 12 '18 10:01 lexborisov

Thanks...

Yes, myhtml_node_child(), not get_child_node() (typo) strange... I'm not using collection. And tokenizer_colorize_high_level() seems to work. I Just do the following: myhtml_parse() node = myhtml_node_child() Verify that tag is TAG_HTML. node = myhtml_node_child(node) Verify that TAG is TAG_HEAD node = myhtml_node_child(node) while(node) parse_node(node) node = myhtml_node_next(node)

At some time, my parse_node() function will parse TAG_SCRIPT, and this is where I'm doing the myhtml_node_child(node) -> NULL.

fariouche avatar Jan 12 '18 11:01 fariouche

This is maybe linked to myhtml_tree_parse_flags_set(tree, MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN| MyHTML_TREE_PARSE_FLAGS_WITHOUT_DOCTYPE_IN_TREE);

I just tried parse_without_whitespace example, and I see that

fariouche avatar Jan 12 '18 11:01 fariouche

I confirm that this is because of MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN.

Is a script a whitespace?

fariouche avatar Jan 12 '18 11:01 fariouche

I think there's a bug with MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN flag

lexborisov avatar Jan 12 '18 11:01 lexborisov

myhtml_collection_t *text=myhtml_get_nodes_by_tag_id_in_scope(tree,NULL,classname_list->list[i]->child,MyHTML_TAG__TEXT, NULL);

const char *title=myhtml_node_text(text->list[0],NULL); printf("%s\n",title)

donglu avatar Apr 04 '18 16:04 donglu

If you want "true" analog of innerText (!= textContent), i have some example: https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L282 written by spec.

If you want more simple textContent - https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L252

Azq2 avatar May 23 '18 21:05 Azq2