tree-sitter-html Handling serialisation overflow

I've been looking through external scanners, and was trying to work out how the serialisation worked in this package. Originally, I thought it would try it's best and discard deeper scopes than it could hold, and that seems to be the case.

However, it does not store whether it discarded tags or not. This leads to closing tags being marked as erroneous, even when they match. E.g., the following has a middle tag that's too long to be stored properly. The current behaviour is to mark it's closing tag and </foo> as invalid.

<html>
<bar>
  <abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
    <foo>

    </foo>
  </abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>

This seems worse than marking invalid endings as valid. Could it use a bit to indicate if all the tags were saved, or if some were discarded? Could this be used to improve the behaviour?

It gets worse when same tags are nested:

<html>
<bar>
  <abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
    <foo>
      <bar>

      </bar>
    </foo>
  </abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij>
</bar>
</html>

Because bar was the only name that made it onto the buffer, it sees the inner closing bar and thinks that closes everything, so marks the rest as invalid. I don't even know if it's possible to account for this one, considering some tags can be self closing (so we can't just count tag ends for the number we discarded).

What about marking the "serialisation" as invalid or something, and requesting a full reparse from some point? (I was going to reference the TextMate parser here, but then saw it doesn't even try to match anything. So no more suggestions).

Nov 10 '18 13:11 Aerijo

I guess there's two main parts to this question; how to handle arbitrary depth / size, and how to make it still be fast.

Would it be possible (or a good idea) to allocate memory for each custom tag, and make the serialiased buffer point to this memory? That way, the size wouldn't matter and it could consistently handle a certain depth.

Nov 10 '18 23:11 Aerijo

Would it be possible (or a good idea) to allocate memory for each custom tag, and make the serialized buffer point to this memory? That way, the size wouldn't matter and it could consistently handle a certain depth.

We could do that, but the question is when to clean up those separately-allocated custom tag names. In the process of editing, there would be a lot of intermediate states where you had a different custom tag name, like:

at, ato, atom, atom-, atom-t, atom-te, atom-tex, ... atom-text-editor

Maybe we could only allocate them separately once you had a matching close tag? It's definitely possible but I'm just not sure it's worth the complexity. Do people ever have very very long tag names? I think they're always going to be around 20 characters or less in practice.

However, it does not store whether it discarded tags or not. This leads to closing tags being marked as erroneous, even when they match. This seems worse than marking invalid endings as valid.

Yeah, that's true. Maybe we should store a "discarded tag count" so that we could always do this?

Nov 16 '18 00:11 maxbrunsfeld

Do people ever have very very long tag names?

I don't even use HTML (currently). I'm just trying to make the parser as general as possible. I agreee that long names are very unlikely to be used, but I also think this parser should have the capacity to support arbitrarily nested and tagged HTML, without a hard limit.

If memory allocation is allowed, then another idea is to treat the buffer as a linked list; mark at the end whether it stops there, or continues at another location (potentially repeated, though a limit could be enforced). My concern there was if a serialised buffer is always deserialised, as it could cause a memory leak if not.

On the Tree-sitter side, maybe it could look for a method to request a larger buffer size, up to some limit? This way, sane HTML works as it does currently, but it would be able to accommodate deeper nesting on demand.

Nov 16 '18 00:11 Aerijo