parjs V2 documentation and design

I've been doing a lot of design work for v2, which in my case basically means writing documentation. It's a bit weird to write documentation as a design document, but it seems to work for me, and I think the result is pretty good.

I haven't implemented anything, though I have made various experiments to make sure some of the ideas can be implemented.

I'm using this PR more as a tool to talk about the changes, features, etc, so I'm not really expecting a review of the actual content. There are a lot of articles that kind of go nowhere, the SCSS is not great, and there are problems with the visual theme.

If you open parjs.vault with Obsidian, it will render a table that shows progress on each article, so you can go straight to ones with content.

Alternatively, you can just go to https://publish.obsidian.md/parjs/%E2%9C%A8+START+HERE+%E2%9C%A8 to see the published version. There are some issues with the theme over there, and it doesn't have a few of the visual gimmicks that you get if you open it locally, and it also doesn't render the progress report, but the articles are the same.

It also makes sense to check out https://publish.obsidian.md/parjs/breaking+changes

Jun 25 '24 22:06 GregRos

coverage: 93.945% (-5.9%) from 99.813% when pulling b88bfe2a99ceedf643e0352be0b0a91345cb1a19 on v2/draft-docs into 2eedef1e20d343566d55cd40773d13b0907095dc on master.

Jun 25 '24 22:06 coveralls

coverage: 93.945% (-5.9%) from 99.813% when pulling b88bfe2a99ceedf643e0352be0b0a91345cb1a19 on v2/draft-docs into 2eedef1e20d343566d55cd40773d13b0907095dc on master.

Jun 25 '24 22:06 coveralls

Hmm I’ve decided I should write a bit about why I’ve made the design decisions and changes.

I’ve recently worked on #100 and #98 and generally improving support for parsing Unicode characters. This required a rewrite of the char-info package with the following goals:

Support characters outside the BMP
Improve performance by using a more efficient data structure
Create an automatic process to construct the data structure based on a given Unicode version, where previously it was just some stuff I copied and pasted from another library.
Do the same for type definitions involving the names of Unicode entities, such as Scripts, Blocks, Properties, and Categories.

I also had some ideas for improving the package’s performance by an order of magnitude.

As part of that project, I decided to use parjs to parse the UCD, the Unicode Character Database. To see how that looks like, you can check out this branch.

I thought this would be a really good exercise of dogfooding, as well as a usability and performance test (the UCD includes text files the size of multiple megabytes).

I haven’t used the library for a while, and honestly I found using it quite frustrating, and not up to my standards today. There were many issues:

Debugging was very frustrating and did not create a connection between the code I was writing and the error I was seeing
Visualizations were frequently misleading or incorrect
Failure reasons were confusing and unclear

Moreover, performance was frankly quite bad, when compared to using more primitive string operations and even in some cases to regular expressions. The UCD doesn’t actually contain Unicode characters, only ASCII, so this wasn’t related to the char-info package at all.

As I was using it, I came up with many ideas for improving the package’s design and performance. Some of them are based on the proposal #101, while other ideas were stuff I had cooking my brain over the past year or so, including the list of ideas I conveyed to @mikavilpas back in October.

Some concepts:

Formalizing the concept of a “parse graph” that lived in my head since I started working on the library.
1. A trace that points to a position in the parse graph, which can be formatted as a string similar to a stack trace.
“Fatter” parser nodes constructed incrementally using tuners and boosters, instance methods that don’t create parent parsers but instead return a new parser based on the current one.
This is especially true for char parsers, which use an entirely different model that supports parsing arbitrarily complex character classes without any change to performance.
Making the “character” be parser-specific and not related to JavaScript characters.
A lot of precomputation when creating parsers in order to improve performance when executing them.
1. Capturing stack traces when parsers are created to link parsing failures to user code.

Jun 26 '24 11:06 GregRos

Wow, this seems like the start of a great new adventure 🙂

There's a ton of stuff to read and digest here, so I think it's best I do it in small parts. I think I'll disregard any notion of structure and go full ADD mode right from the start - so here's some random thoughts:

Debugging was very frustrating and did not create a connection between the code I was writing and the error I was seeing ... Formalizing the concept of a “parse graph” that lived in my head since I started working on the library.

Yeah I have been bit by debugging challenges myself. I think there is definitely room for improvement here.

In my previous job, I worked on a parser combinator system that was implemented in scala. Whenever it parsed a token in the language, it also recorded metadata about the parsing context.

Although I think it was mostly the position in the source text, I think using this approach for a more detailed parse graph could work. It could also make it possible to provide multiple ways of rendering an error: maybe different for testing, a runtime (debug) error, or a totally custom implementation. As an example, at my job we showed parse errors as "red squiggly lines" with https://github.com/microsoft/monaco-editor.

Jun 26 '24 16:06 mikavilpas

Related to performance, do you have a hunch about what things are the cause of performance issues/limitations right now?

Jun 26 '24 17:06 mikavilpas

Although I think it was mostly the position in the source text, There's a ton of stuff to read and digest here

Yup, haha. One thing led to another and it just came out this way. Please tell me if there is anything I can do to make it more comprehensible

There's a ton of stuff to read and digest here, so I think it's best I do it in small parts. I think I'll disregard any notion of structure and go full ADD mode right from the start - so here's some random thoughts:

Haha damn, you too? That’s great! Go full ADD, it’s the best.

In my previous job, I worked on a parser combinator system that was implemented in scala. Whenever it parsed a token in the language, it also recorded metadata about the parsing context.

Although I think it was mostly the position in the source text, I think using this approach for a more detailed parse graph could work. It could also make it possible to provide multiple ways of rendering an error: maybe different for testing, a runtime (debug) error, or a totally custom implementation. As an example, at my job we showed parse errors as "red squiggly lines" with https://github.com/microsoft/monaco-editor.

That’s a really good idea! I think we can combine a parse graph together with capture information, and maybe even the user state.

A “capture graph” would be a structure similar to a parse graph, but each node would be annotated with a set of captures. A lot of times errors are actually caused by successes that weren’t supposed to happen, and this would let you see if some parsers were a bit greedier than you expected.

Some ways to visualize it without an IDE frontend:

Using ANSI colors and formatting.
Using combining Unicode characters like underlines and overlines

Here is a toy example of how 1 might look like, which I just made using chalk:

The numbers would be matched to names of parsers in a legend below the visualization.

In practice, this visualization is a little tricky to implement because parsers might be large with a lot of nesting. Any text captured by a child parser is also captured by the parent parser, and if we want to visualize that, we would need to use many different layers.

Maybe instead of visualizing all parser captures, we can visualize only ones deemed important by the user. If we only associate some parsers with capture metadata, it would also be cheaper in terms of performance.

There’s not just successes, of course. There are also failures. A more complete graph would have failed nodes, or entire subgraphs of captures that were reverted due to a panic (Hard fail).

I’m at a loss about how to visualize this more complete version, except for literally drawing a graph using something like [graphviz](https://graphviz.org/).

Jun 28 '24 10:06 GregRos

About performance, I know general things that slow down JavaScript code:

Allocations and resulting GC
Function call overhead
When the JS engine decides to disable optimizations for bits of code or specific objects. In that case, absolutely any code can be a bottleneck, if some really important optimization is disabled in it. There are lots of different optimizations and they're all kind of interrelated but can occur independently too.

I'm going to profile it and see if I can see anything.

A bit more about debugging. Right now, if you're stopped on a breakpoint in a projection function, you can't really know where you are within the parse graph, since the parse stack isn't part of the call stack, and there is generally no relationship between them.

It might be possible to use various tricks so that frames linked to the location where you constructed the parser will show up in the call stack you see when you're in a breakpoint. I have some ideas. It probably won't be cross-platform and will work only in V8.

Besides that, there could be some members added to the global scope that can tell you where you are in the parse graph, let you examine the parse stack, run queries on the capture graph, etc.

Jun 28 '24 16:06 GregRos

Honestly, thinking about it, (2) can just be described as a special case of (3), since one of the optimizations is getting rid of function calls by inlining. So if that optimization was on all the time (that's impossible but it's the same with other optimizations), you wouldn't have any function call overhead.

I don't think V8 has any optimizations that help with (1). I think it's possible in principle. If the engine can guarantee a specific object isn't used anywhere outside this bit of code, it can just be stack allocated instead of heap allocated. JS engines don't do this (yet?).

Jun 28 '24 17:06 GregRos

Maybe instead of visualizing all parser captures, we can visualize only ones deemed important by the user. If we only associate some parsers with capture metadata, it would also be cheaper in terms of performance.

I think this is a very good idea. Most of the time working on a parser is focused on one small area of it rather than the whole. I also love the example with the colors! That's an insane step up from what is currently available 👍🏻

Jun 28 '24 18:06 mikavilpas

I’m at a loss about how to visualize this more complete version, except for literally drawing a graph using something like [graphviz](https://graphviz.org/).

Well, why not! As long as visualization helps the user, this sounds good. An intermediary presentation would nicely allow rendering in many formats, such as the terminal, an image or other static visualization, or an interactive one.

Another idea: the visualization could even be interactive. I sometimes spend time staring at JS bundle analyses that do this, e.g. https://esbuild.github.io/analyze/ (click on "load an example")

Jun 28 '24 18:06 mikavilpas

Related to performance, do you have a hunch about what things are the cause of performance issues/limitations right now?

So yeah it turned out to be mostly function call overhead, at least for the parser I was running.

That just means that the fatter each parse node is (i.e. the more logic it's responsible for), the better, since fewer functions need to be called. It also looks like the current method of having a function called apply that does stuff and then calls _apply is bad, since it basically doubles the number of calls.

I got rid of the apply and with a few more tweaks got it to run twice as fast.

Another idea: the visualization could even be interactive. I sometimes spend time staring at JS bundle analyses that do this, e.g. https://esbuild.github.io/analyze/ (click on "load an example")

That would be really cool!

Jun 30 '24 21:06 GregRos

I took a look at the deoptimization logs. The main takeaway is that a lot of functions become what v8 called "megamorphic" because they're called with arguments that have different "types" (the different parser types).

If we pass around identically-typed, streamlined parser core objects instead of the entire wrapper, code working with them will become monomorphic, which will make function calls execute faster. That's part of the plan for v2.

There is a cool extension called deoptexplorer-vscode that can investigate v8 optimization logs. the launcher package dexnode lets you generate those logs. I just run it with npx dexnode file.js and open the resulting .log file with the extension.

dexnode just starts node with tons of CLI arguments for enabling optimization logging, like --trace-opt.

Jul 01 '24 11:07 GregRos

I finally figured out how to do the unicode char parser. I've written a custom immutable trie data structure inspired by https://unicode-org.github.io/icu/design/struct/utrie that's optimized for storing individual bits per Unicode character, with support for sparse and ranged-based data.

Still needs testing but will probably create a separate PR for it soonish.

Jul 07 '24 09:07 GregRos

The idea of the parser core objects has evolved to just passing around closures. I asked in one of the v8 engine groups and it seems this method will probably work for avoiding megamorphism.

My current design is having three different representations of the parse tree, each with its own type of node.

Logic

The "logic" node is just a closure with all the parameters baked in. It takes a ParjsState object that returns a ParjsResult. So purely functional.

This avoids the engine trying to resolve a parse method.
The parse nodes are always the same type (function).

The downside is that this version of a parse node is opaque. It can't be inspected from the outside and can't be manipulated by users.

Here is some snipped code I'm working on:

function _read(
    node: ParseNode , // see below
    length: number
): ParjsLogic<string> {
    const result = _createLogicResult({
        reason: msg_parsed_all_chars,
        signal: ParjsSignal.Ok,
        value: ""
    })
    return (state: ParjsState) => {
        const start = state.position
        for (let i = 0; i < length; i++) {
            const char = state.input[state.position]
            if (char === undefined) {
                result.reason = msg_reached_eof
                result.signal = ParjsSignal.UnexpectedEOF
                return result
            }
            state.position++
        }
        result.value = state.input.slice(start, state.position)
        result.reason = msg_parsed_all_chars
        result.signal = ParjsSignal.Ok
        return result
    }
}

The result.value can be polymorphic in combinators like many. I think it will be okay provided we don't try to access it.

Traversable

A "traversable" version that shows you the connections between parse nodes, which will be used to generate stuff like debug information, could be used to render the parse tree as a graph and stuff like that.

Node objects here will have identical structure, with a name, children, etc. All information about the parsers will be encoded as values, even using maps instead of objects:

class ParseNode {
   kind: "Parser" | "Combinator"
   type: string
   kids: Map<string, ParseNode>
   options: Map<string, Primitive>
}

This similarly ensures monomorphism, and allows for this type of node to be embedded inside the logic node as part of the closure.

Traversing these objects will only happen once during parsing -- if there is an error that needs to be communicated to the user.

API Node

This kind of node is what users will interact with. Users pass these objects around and build parsers with them, but when the parser is actually constructed, each API node is used to construct a logic node and a traversable node, and it's not embedded in the parse tree at all.

This is because API nodes are very polymorphic, having different methods, properties, classes, and so on. So they can't be used during execution.

You can view them as handy tools to construct the more opaque, streamlined logic and traversable nodes.

Pretty much every bit of documentation I've written involves these kinds of nodes. So they're the ones that support tuners, have the pipe method, and so on. They're the ones that are returned from parser factories too.

Apr 14 '25 21:04 GregRos