myhtml icon indicating copy to clipboard operation
myhtml copied to clipboard

Element wrong location level error handling

Open elekt opened this issue 6 years ago • 11 comments

I am working on a project that parses html and replaces href attributes. If the html is invalid because instead a table cell (ex. <td>) an <a> tab is coming, in myhtml_insertion_mode_in_table, it tries to handle the parse error by "foster parenting" and calling myhtml_insertion_mode_in_body with the <a> token.

The problem is that by this that when I loop through the tree's nodes it seems that the node is added twice. The clone is added in myhtml_tree_active_formatting_reconstruction.

See the minimal html to reproduce: testminimal_github.txt

In my application I throw away the copy of the node but for some reason if this happens the href (link 1 in the example) remains the same. Also it messes up the order I get the nodes with node = myhtml_node_next(node). I would like to fix this bug in myhtml, and I would appreciate some help.

I am not looking to fix the invalid html, but to make sure each href links are changed and the structure stays the same.

elekt avatar Mar 19 '18 14:03 elekt

Hi! I'll deal with this soon. Thanks!

lexborisov avatar Mar 19 '18 15:03 lexborisov

I'm trying to understand the problem. But I do not understand. Actually, the specification requires this. Try to see how this example is handled in a modern browser.

lexborisov avatar Mar 26 '18 16:03 lexborisov

Elekt is my colleague. Our use case is the following:

  1. Parse input HTML
  2. Modify some attributes
  3. Regenerate the HTML, but keep it as close to the original input HTML as possible (so without fixing it, adding more nodes, et cetera)

Is there a way how we can find out whether a node was artificially added by myhtml? We currently check if position.length == 0, but this does not work in the example given above.

EmielBruijntjes avatar Mar 28 '18 07:03 EmielBruijntjes

There is a "flags" member in myhtml_tree_node_t, but it looks like it is not really in use. It would be nice if this flag can be set to a special value, and that user space programs can inspect it, and check if a node was (for example):

  • a real node that comes from the input HTML
  • an artificial node that was created by myhtml to fix a broken tree
  • a node that was moved to a different location in the tree to fix things
  • a node that was duplicated and added to the tree to fix things (like the links in the above example)
  • a node that was later modified by the user space program (like having a modified attribute)
  • a mismatched node (like not linked to a closing node)
  • a node that was opened-and closed in one tag (like <br/>)
  • et cetera

For our own use case it would already be very helpful if we could recognize "artificial" nodes, so that we can skip them when we regenerate the source code.

EmielBruijntjes avatar Mar 28 '18 09:03 EmielBruijntjes

I found bug. We need pos.len = 0 (for clone element), but now it contains a garbage. Need to fix it.

lexborisov avatar Apr 03 '18 14:04 lexborisov

Can you ellaborate a bit more? I assume it need to be set in myhtml_tree_node_clone.

elekt avatar Apr 10 '18 08:04 elekt

It seems that no, today I will try to deal with this. It is necessary to understand at what point the cloned nodes have garbage in the position values. Position values in cloned nodes must be zero.

lexborisov avatar Apr 10 '18 10:04 lexborisov

Hello @lexborisov, do you need more info or help in any form?

EmielBruijntjes avatar Apr 17 '18 08:04 EmielBruijntjes

@EmielBruijntjes I understood the task, but it will take time. In enum myhtml_tree_node_flags we need to create, some like a MyHTML_TREE_NODE_CLONE, MyHTML_TREE_NODE_MOVED.

For use:

if (node->type & (MyHTML_TREE_NODE_CLONE|MyHTML_TREE_NODE_MOVED)) {
...
}

lexborisov avatar Apr 18 '18 18:04 lexborisov

@lexborisov Is there anything that I can do to help you here? It's a feature that we really like to have.

EmielBruijntjes avatar May 15 '18 06:05 EmielBruijntjes

Sorry, but in the current project, I can not do anything about it. Just somehow mark the cloned elements. But I would not want to spend that time.

lexborisov avatar Aug 17 '18 09:08 lexborisov