arcsecond
arcsecond copied to clipboard
Implement packrat parsing
Packrat Parsing Implementation
As discussed in #74, this pull request implements packrat parsing using a internally specified set of memoization functions in ./src/cache.ts
.
- 1. Checklist
-
2. Discussion
- 2.1. Incremental Parsing
- 3. Suggested release notes
- 4. Questions to resolve:
1. Checklist
- [ ] This PR only introduces changes to documentation.
- Please include a summary of changes and an explanation
- [X] This PR adds new functionality
- [X] Is there a related issue?
- Please note the related issue below, and how this PR relates to the issue if appropriate
- [X] Does the code style reasonably match the existing code?
- [ ] Are the changes tested (using the existing format, as far as is possible?)
- [ ] Are the changes documented in the readme with a suitable example?
- [ ] Is the table of contents updated?
- [X] Is there a related issue?
- [ ] This PR introduces some other kind of change
- Please explain the change below
2. Discussion
Packrat parsing [1] is a technique for enabling linear-time backtracking in top-down parsers. A common issue with recursive descent parsing is exponential time blowup, whereby the time it takes to parse a string scales exponentially with respect to input length.
Our implementation here using a memoization wrapper around the incremental state update function for each parser. The full, global cache state is cleared on every run.
2.1. Race Condition Risk
If two instances of run
are running in parallel, this could lead to a poisoned cache, and incorrect parse results. Since the parser does not operate on the basis of Promises, it is unclear to me whether this is an important risk.
2.2. Incremental Parsing
This implementation resets the global cache on every run
and fork
call. This is necessary to avoid errors in parsing - because the cache is parameterized only on state
, and not target
, the cache can become poisoned unless it is refreshed on every individual run.
On our fork, I think we will attempt to parameterize the state update cache with target
, so that the cache may persist across multiple runs, thereby implementing efficient incremental parsing. A configurable LRU cache will be utilized to prevent arbitrary memory usage.
I have avoided implementing this more advanced form in this pull request, as it is a lot more experimental, opinionated, and risky. As yet, I am not convinced it's even possible to do easily - if target
always corresponds to do full input, then parameterizing the cache on target
won't implement incremental parsing.
3. Suggested release notes
Versioning is left up to maintainer discretion, but for convenience here is my suggested changelog.
## 4.1.0
- Implement Packrat Parsing to improve cyclomatic performance
- All incremental state update functions are cached
- The entire cache is cleared before `run` or `fork` begins parsing
4. Questions to resolve:
- Do we want tests for
cache.ts
? I did not want to unnecessarily bloat the pull request, but if we want them I can write some. - Do we want to wait for incremental linear parsing after all?
- Should changing the cyclomatic complexity of a function count as a breaking change?
- Is any additional documentation necessary about this internal design?
hey @poteat - thanks for putting this together so quickly! I'm on a little family retreat for a few days, so I'll take a proper look when I get back next week 👍
@francisrstokes I'd love to pick this back up at some point!
@francisrstokes and @poteat , this pull has been pending for a while. Can this be merged with the base? With the kind of performance improvements as mentioned in the parent issue, I think this will be a great win for all arcsecond users.
My apologies for not getting back to this PR in a more timely manner (to put it lightly!). I would also love to see the performance benefits provided by this approach make their way into the repo. I do have concerns about the cache though - exactly what you have mentioned @poteat - that it essentially disallows concurrent usage of the library.
To solve this, we need a mechanism where we can have multiple, contextual caches, or at a minimum, the ability to opt-in and out of caching.
Instead of caching at construction time, we might want to have this process take place at parser.run
time using a user provided cache object. What do you both think?
I fear I do not understand the idea well enough to have an opinion on a user provided cache object. About the caching opt-in, I think just a simple caching on/off flag should be a great first step. Based on further user feedback you can then decide the best way to go forward. For now, any way to stop combinatorial explosions would be welcome.