myhtml icon indicating copy to clipboard operation
myhtml copied to clipboard

Chunked parsing issue

Open skapix opened this issue 4 years ago • 1 comments

Parsing is done with chunking with the following code:

myhtml_tree_t* Parse(myhtml_t* myhtml, const std::string& body,
                     size_t chunk_sz) {
  myhtml_tree_t* tree = myhtml_tree_create();
  myhtml_tree_init(tree, myhtml);
  size_t body_chunk_pos = 0;
  while (body_chunk_pos < body.size()) {
    size_t current_chunk_sz = std::min(chunk_sz, body.size() - body_chunk_pos);
    mystatus_t parse_status = myhtml_parse_chunk_single(
        tree, body.c_str() + body_chunk_pos, current_chunk_sz);
    if (parse_status != MyHTML_STATUS_OK) {
      myhtml_tree_destroy(tree);
      return nullptr;
    }
    body_chunk_pos += current_chunk_sz;
  }
  return tree;
}

And called with arguments:

myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
std::string body = "<html><head><style>a</style></head><body>f</body></html>";
size_t chunk_sz = 13;
myhtml_tree_t* tree = Parse(myhtml, body, chunk_sz);

Depending on build options, there may be various results. In some cases serialized tree looks like this:

<html><head><style>a</style></head><body>f</body></html></style></head><body></body></html>

In some cases looks like this

<html><head><style></style></head></html>

While it should be:

<html><head><style>a</style></head><body>f</body></html>

After some investigation I found out, that the issue is inside myhtml_tokenizer_state_rawtext_end_tag_name with token_node->raw_begin.

skapix avatar Jul 06 '21 09:07 skapix

Looks like Lexbor project does not have similar issue. But it's also nice to have it here since it's a standalone html5 parser.

skapix avatar Jul 06 '21 09:07 skapix