RFC: Use a flat structure for storing nodes instead of an AST

Open DavidHancu opened this issue 7 months ago • 0 comments

Describe the feature

This RFC (Request for Comments) is meant as a suggestion for an improvement to the memory structure of the current syntax tree based on data-oriented patterns.

This RFC also aims to provide full compatibility with the Visit trait from swc_ecma_visit. The VisitMut trait implementation details are still up for debate.

1. Introduction to Data-Oriented Design

Processors load data by cache lines (usually 64 or 128 bytes) which makes contiguous arrays perfect for data similar in semantics. Because of cache lines, the processor will load the next few elements from the array which has the possibility of improving the performance of visiting the tree.

2. The New Format

To continue iterating on this RFC, we must define the new structs and enums that compose our new format. To do this, let's start by defining the Node struct, which is the core item inside of our new format.

struct Node {
    kind: NodeKind,
    data: NodeData,
    id: NodeId
}

Each Node will represent an item in the tree. These can be expressions, declarations, and others.

Now let's define the enums that Node uses.

2.1. `NodeKind`

In Rust, the size of an enum is the same as the size of the largest variant in the enum. This brings some performance hits when attempting to match the enum variant. To fix this, we will implement a new NodeKind enum that is 1-bit sized and holds the type of a Node.

The definition for NodeKind would be as follows:

pub enum NodeKind {
    ArrayLit,
    ArrayPat,
    // ...
}

Every struct and enum inside the current version of swc_ecma_ast will be used to create the NodeKind enum.

2.2. `NodeData`

An AST without data would be pretty useless. That's why we will also have the NodeData enum, which will contain the current structs inside swc_ecma_ast, with some modifications to fit them into our new format.

Most notably, references to other Nodes will be replaced with NodeId.

pub enum NodeData {
    ArrayLit(ArrayLit),
    ArrayPat(ArrayPat),
    // ...
}

2.3. `NodeId`

To be able to reference Nodes, we will use IDs, which are indexes into our Vec.

pub struct NodeId(u64);

2.4. `Module`

The new Module struct will have the following form:

pub struct Module {
    nodes: Vec<Node>
}

3. Understanding `Vec<Node>`

To understand the way that Vec<Node> is constructed via the parser, we have to make sure that children always come after their parents.

To make this easier to grasp, I've provided a tree structure to flatten format transformation example.

item1
  item2
    item3
  item4

The tree above will be organized as:

item1, item2, item3, item4

4. Compatibility with `Visit`

Due to the ordering of the nodes inside of Vec<Node>, the Visit trait is trivial to implement. A naive implementation would look like this:

for node in module.nodes {
  // Based on the node kind, call the correct function on the visitor
}

This does not include any control flow but that can be easily implemented later.

5. Closing Thoughts

Please let me know if you have any ideas or questions regarding this RFC. Also, solutions for VisitMut are needed before attempting to implement this, otherwise we'd be introducing breaking changes in the API.

Babel plugin or link to the feature description

No response

Additional context

No response

Jul 20 '24 07:07 DavidHancu

swc swc copied to clipboard

RFC: Use a flat structure for storing nodes instead of an AST

Describe the feature

1. Introduction to Data-Oriented Design

2. The New Format

2.1. NodeKind

2.2. NodeData

2.3. NodeId

2.4. Module

3. Understanding Vec<Node>

4. Compatibility with Visit

5. Closing Thoughts

Babel plugin or link to the feature description

Additional context

swc
swc copied to clipboard

2.1. `NodeKind`

2.2. `NodeData`

2.3. `NodeId`

2.4. `Module`

3. Understanding `Vec<Node>`

4. Compatibility with `Visit`