myhtml
myhtml copied to clipboard
SEGFAULT when parsing CDATA in single threading mode.
When using the parser in MyHTML_OPTIONS_PARSE_MODE_SINGLE mode, it is initialized in myhtml_init like this:
case MyHTML_OPTIONS_PARSE_MODE_SINGLE:
if((status = myhtml_create_stream_and_batch(myhtml, 0, 0)))
return status;
As this call specify that is need 0 stream, the myhtml->thread_stream is initialized to NULL.
myhtml->thread_stream = NULL;
But then, when parsing CDATA (in myhtml_tokenizer_state_markup_declaration_open()), the parser try to call myhtml_tree_wait_for_last_done_token(), which try to access unconditionally tree->myhtml->thread_stream->timespec and obviously it crashes (thread_stream is NULL).
Backtrace:
myhtml_tree_wait_for_last_done_token(tree=., token_for_wait=.) at tree.c:2457
myhtml_tokenizer_state_markup_declaration_open(tree=., token_node=., html="…", html_offset=413, html_size=378555) at tokenizer.c:943
myhtml_tokenizer_chunk_process(tree=., html="…", html_length=378555) at tokenizer.c:88
myhtml_tokenizer_chunk(tree=., html="…", html_length=378555) at tokenizer.c:104
myhtml_tokenizer_begin(tree=., html="…", html_length=378555) at tokenizer.c:42
myhtml_parse_fragment(tree=., encoding=MyENCODING_DEFAULT, html="…") at main.c
Hi @Jean-Daniel In a single mode, tokens will always be equal and the program will not enter the loop. Do you have an example html where the program in a single mode enter to this loop?
I saw and corrected another problem. Please, try code from master.
Thanks for the report!
Sorry, I didn't gave you enough info. I'm actually using the parser to extract some data from html fragments (I only have the
content), and I don't really need a full tree. So I'm using the 'after token done' callback, and disable the tree by usingMyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE.
A quick test reveal that this is the later flag that trigger the bug. Without it, the parser works flawlessly, but when I set this flag, it crashes on CDATA.
#import <myhtml/api.h>
int main(int argc, char **argv) {
const char *bytes = "<div><![CDATA[ foo ]]></div>";
size_t length = strlen(bytes);
myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_PARSE_MODE_SINGLE, 1, 0);
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);
myhtml_tree_parse_flags_set(tree, MyHTML_TREE_PARSE_FLAGS_WITHOUT_BUILD_TREE | MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN);
// parse html (we only have the body)
myhtml_parse_fragment(tree, MyENCODING_UTF_8, bytes, length, MyHTML_TAG_BODY, MyHTML_NAMESPACE_HTML);
myhtml_tree_destroy(tree);
myhtml_destroy(myhtml);
return 0;
}