arcsecond icon indicating copy to clipboard operation
arcsecond copied to clipboard

Implement packrat parsing

Open poteat opened this issue 3 years ago • 5 comments

Packrat Parsing Implementation

As discussed in #74, this pull request implements packrat parsing using a internally specified set of memoization functions in ./src/cache.ts.

  • 1. Checklist
  • 2. Discussion
    • 2.1. Incremental Parsing
  • 3. Suggested release notes
  • 4. Questions to resolve:

1. Checklist

  • [ ] This PR only introduces changes to documentation.
    • Please include a summary of changes and an explanation
  • [X] This PR adds new functionality
    • [X] Is there a related issue?
      • Please note the related issue below, and how this PR relates to the issue if appropriate
    • [X] Does the code style reasonably match the existing code?
    • [ ] Are the changes tested (using the existing format, as far as is possible?)
    • [ ] Are the changes documented in the readme with a suitable example?
    • [ ] Is the table of contents updated?
  • [ ] This PR introduces some other kind of change
    • Please explain the change below

2. Discussion

Packrat parsing [1] is a technique for enabling linear-time backtracking in top-down parsers. A common issue with recursive descent parsing is exponential time blowup, whereby the time it takes to parse a string scales exponentially with respect to input length.

Our implementation here using a memoization wrapper around the incremental state update function for each parser. The full, global cache state is cleared on every run.

[1] Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking (PDF). Bryan Ford, 2002, MIT

2.1. Race Condition Risk

If two instances of run are running in parallel, this could lead to a poisoned cache, and incorrect parse results. Since the parser does not operate on the basis of Promises, it is unclear to me whether this is an important risk.

2.2. Incremental Parsing

This implementation resets the global cache on every run and fork call. This is necessary to avoid errors in parsing - because the cache is parameterized only on state, and not target, the cache can become poisoned unless it is refreshed on every individual run.

On our fork, I think we will attempt to parameterize the state update cache with target, so that the cache may persist across multiple runs, thereby implementing efficient incremental parsing. A configurable LRU cache will be utilized to prevent arbitrary memory usage.

I have avoided implementing this more advanced form in this pull request, as it is a lot more experimental, opinionated, and risky. As yet, I am not convinced it's even possible to do easily - if target always corresponds to do full input, then parameterizing the cache on target won't implement incremental parsing.

3. Suggested release notes

Versioning is left up to maintainer discretion, but for convenience here is my suggested changelog.

## 4.1.0

- Implement Packrat Parsing to improve cyclomatic performance
  - All incremental state update functions are cached
  - The entire cache is cleared before `run` or `fork` begins parsing

4. Questions to resolve:

  • Do we want tests for cache.ts? I did not want to unnecessarily bloat the pull request, but if we want them I can write some.
  • Do we want to wait for incremental linear parsing after all?
  • Should changing the cyclomatic complexity of a function count as a breaking change?
  • Is any additional documentation necessary about this internal design?

poteat avatar Aug 10 '21 18:08 poteat

hey @poteat - thanks for putting this together so quickly! I'm on a little family retreat for a few days, so I'll take a proper look when I get back next week 👍

francisrstokes avatar Aug 13 '21 12:08 francisrstokes

@francisrstokes I'd love to pick this back up at some point!

poteat avatar May 27 '22 00:05 poteat

@francisrstokes and @poteat , this pull has been pending for a while. Can this be merged with the base? With the kind of performance improvements as mentioned in the parent issue, I think this will be a great win for all arcsecond users.

rishavs avatar Jul 15 '22 11:07 rishavs

My apologies for not getting back to this PR in a more timely manner (to put it lightly!). I would also love to see the performance benefits provided by this approach make their way into the repo. I do have concerns about the cache though - exactly what you have mentioned @poteat - that it essentially disallows concurrent usage of the library.

To solve this, we need a mechanism where we can have multiple, contextual caches, or at a minimum, the ability to opt-in and out of caching.

Instead of caching at construction time, we might want to have this process take place at parser.run time using a user provided cache object. What do you both think?

francisrstokes avatar Jul 15 '22 11:07 francisrstokes

I fear I do not understand the idea well enough to have an opinion on a user provided cache object. About the caching opt-in, I think just a simple caching on/off flag should be a great first step. Based on further user feedback you can then decide the best way to go forward. For now, any way to stop combinatorial explosions would be welcome.

rishavs avatar Jul 15 '22 19:07 rishavs