Molten icon indicating copy to clipboard operation
Molten copied to clipboard

Alternative AST design

Open matklad opened this issue 6 years ago • 7 comments

Hi!

I'd like to share a possible alternative design for the lossless AST. I think the current Molten design is great, but it's always better to have more options available :) I maintain TOML plugin for the IntelliJ platform which uses this AST representation, and such representation is pretty typical for IntelliJ, and works great if you care about all trivia.

The idea is to create a two-layered tree structure. As the base, there's a homogeneous "untuped" tree which stores only spans in the original file, and the typed, "Abstract" representation is layered on top of it.

In code, the first layer look like this (super simplified)

// An "enum", representing all of the syntactic constructs of a particular language, 
// both simple tokens and composite elements.
struct NodeType(u32); 

//For Toml, there would be a bunch of constants like 
// const WHITESPACE: NodeType = NodeType(0);
// const EQALS_SIGN = NodeType(1);
// ....
// const KEY_VALUE_PAIR = NodeType(92);

// The *single* node type, which has a type, a text range, and a list of children. 
struct Node {
    type: NodeType,
    text_range: (usize, usize),
    children: Vec<Node>,
}

So, the parse function looks like fn parse(input: &str) -> Node. The Node representation is naturally lossless, because it only contains spans. Moreover, you can now write functions easily to traverse and edit the tree without the need to solve the expression problem. Its really nice to write code formatters against this representation, because you can make the rule-based( .around(NodeType).ensureSpaces(u32)).

However, this representation is not really convenient to work with because it is untyped. But you can add types! The idea is to create a separate struct for each important NodeType, such that the struct holds only Node with this NodeType. So, something like this:

trait AstNode {
    const TYPE: NodeType;
    fn new(node: Node) -> Self;
    fn node(&self) -> Node
}

struct TomlKeyValuePair(Node);

impl AstNode for TomlKeyValuePair {
    fn new(node: Node) {
        assert_eq!(node.type, KEY_VALUE_PAIR); // here's the transition between  untyped and typed worlds
        TomlKeyValuePair(node)
    }
}

struct TomlKey(Node);
impl AstNode for TomlKey { ... }

struct TomlValue(Node);
impl AstNode for TomlValue { ... }

impl TomlKeyValuePair {
    fn key(&self) -> TomlKey {
        TomlKey(self.0.children().find(|n| n.type == KEY).unwrap())
    }

    fn value(&self) -> TomlValue {
        TomlValue(self.0.children().find(|n| n.type == VALUE).unwrap())
    }
}

That is, abstract tree is a zero-cost wrapper around bare syntax tree which adds proofs that a particular node indeed holds a particular construct.

Hope this was interesting for you! :)

matklad avatar Nov 17 '17 18:11 matklad

Hey, thank you for the thorough insight!

Currently I want to get the crate feature-complete before making internal changes, but if I hit limitations of my current design in doing so, this is where I'll start.

Question: if the current Items could report their range, would this be a tangible benefit, or is that not useful to you on its own in the current architecture? That would be a fairly trivial change.

LeopoldArkham avatar Nov 22 '17 16:11 LeopoldArkham

Question: if the current Items could report their range, would this be a tangible benefit, or is that not useful to you on its own in the current architecture?

Yeah, I think getting a text range from an element is a pretty fundamental stuff, because you can't express this API via something else, and it can be pretty important for some uses: for example, to display line/column information, or to check if something is at the beginning of the line, or to check the relative order of items in the source.

However, these all seem to be important, but niche use-cases, so I feel that releasing 1.0 without text offsets is totally OK. Especially given that exposing ranges needs some design probably:

  • usize vs u32
  • is it OK to expose byte indices in the API? (some clients really want utf16)
  • For ranges, it may make sense to design a separate TextRange struct, which is also not entirely trivial (I have one here, and I bet the design is wrong).

matklad avatar Nov 22 '17 18:11 matklad

Well we'll probably implement that at some point then.

Regarding design:

  • I want to say usize, since the caller will need that to use it for indexing anyways. Any other concerns?
  • I guess we want to use the same level of abstraction as the Chars iterator the parser uses. Not certain what that translates into Unicode-wise.
  • Your version looks pretty straightforward and clean, I think it would be a good way to go, at a glance.

LeopoldArkham avatar Nov 24 '17 18:11 LeopoldArkham

I guess we want to use the same level of abstraction as the Chars iterator the parser uses. Not certain what that translates into Unicode-wise.

You mean, exposing range of elements in terms of "number of chars"? This has a drawback that you won't be able to slice a &str in O(1). Exposing raw utf8 lengths seems more appropriate, you rarely work at the level of chars anyway.

matklad avatar Nov 25 '17 10:11 matklad

Hi, @LeopoldArkham!

As someone who has tried to implement a lossless toml parser/editor and failed miserably, may I ask you a couple of questions about your AST design?

Toml currently put no restrictions on table ordering, e.g. this is a valid toml document:

[[bin]] # bin 1
[a.b.c.d]
[a]
[other.table]
[a.b.c.e]
[[bin]] # bin 2
[a.b.c]
[[bin]] # bin 3

How does Molten handle this document? If you ask for a key a, does the returned container contains all of the a's subtables? How scattered arrays of tables are stored?

In toml-edit I decided to enforce the constraints (not in how toml is parsed, but in how it is stored and printed). What are your thoughts on this?

ordian avatar Dec 05 '17 22:12 ordian

@matklad Agreed, raw utf8 is the way to go; (I don't "speak" UTF very well)

@ordian The general strategy is to try to keep the AST faithful to the document as much as possible, since that's what's driving the problem, and then glue together anything that needs it in the editing/indexing logic. I may need to make some changes to the AST down the line, but I wanted to try the simplest approach first since TOML is meant to be optimized for ease of prasing (Hint: it isn't :p)

LeopoldArkham avatar Dec 10 '17 14:12 LeopoldArkham

I've tried to implement my design: https://users.rust-lang.org/t/tom-yet-another-format-preserving-toml-parser/18765.

Overall, it seems to work, and I am quite happy with the fact that it allows both lossless parsing as well as lossless unrestricted editing. The problem with that is of course that you probably want some form of restricted editing, so as to prevent, by default, creation of invalid toml documents. Hopefully it should be possible to layer restrictions on top of the unrestricted API!

matklad avatar Jul 13 '18 23:07 matklad