html_editor icon indicating copy to clipboard operation
html_editor copied to clipboard

Improve error handling in parsing

Open xeniagda opened this issue 2 years ago • 2 comments

This pull request does two things: introduce an error enum for possible failures during parsing, as well as track locations for all tokens to give information for errors. A few tests have been added to ensure the validity of the error tracking.

I made the decision to track locations as char indices rather than bytes in the source. This is mainly because this makes the tracking easier to write — we can simply call .enumerate() on the HTML in the html_to_stack function. I have tried to ensure that location gathering will never be O(n²), which could occur if you need to "count backwards" to see how long a thing you've kept in memory is in chars.

xeniagda avatar Apr 19 '23 20:04 xeniagda

1e2aa7d also fixes a slight bug in the Token::from(tag) function, where a tag consisting of only spaces would not be caught by the check because the suffix >//> in the tag would count to the name of the tag.

xeniagda avatar Apr 19 '23 20:04 xeniagda

We could trivially forward the location information to the DOM itself. I think this could be useful for any application which manipulates HTML provided by the user. Any time the user provides a DOM which parses but is considered invalid by the application, they would want to give the user information about what node caused the problem. Being able to point back into the original file would be quite useful here. However, for applications where this isn't the case (such as programs generating or modifying HTML), this could be an added complexity which might be inconvenient for the user. To store the location for each node type, we would probably have to change the Node type to a struct containing an Option<SourceLocation> as well as an instance of the old enum. This extra layer of types could be slightly inconvenient to manipulate, especially if the SourceLocation isn't needed.

Another, more convenient way would be to add the Option<SourceLocation> to the Element struct. This already has a few fields and could easily be ignored for applications which do not need the information. However, this would mean things like text nodes, comments and doctypes would lose the information.

A third, slightly more radical way would be to change all instances of String in the Node type and its descendents to a struct containing the String as well as the Option<SourceLocation>. This type could maybe Deref to a String. This would mean every single aspect of a Node, such as attribute keys and values, would be tracked.

I will not make any choice here, but I might implement one of these approaches in my fork to use in an application I'm developing.

xeniagda avatar Apr 19 '23 20:04 xeniagda