commonmark
commonmark copied to clipboard
Optimize regular expressions
This library makes heavy use of regular expressions. While most of them should be fairly performant, there could certainly be some room for improvement to help improve the performance of this library. Examples of improvements might include:
- Replacing non-regex parsing logic with regular expressions (if that's quicker)
- Replacing regex-based parsing with logic that doesn't use regular expressions (if that's quicker)
- Combining multiple regexes into one (if that's quicker)
- Fixing excessive backtracking in expressions
- Other improvements to existing expressions
- ???
Tools that could help here include:
- The debugger on https://regex101.com/, especially to check for excessive backtracking
- Our benchmark.php script
- A performance profiler like Blackfire
A partial list of areas where regex is used in this library include:
- https://github.com/thephpleague/commonmark/blob/main/src/Util/RegexHelper.php
- https://github.com/thephpleague/commonmark/blob/main/src/Parser/Cursor.php
- Implementations of:
-
BlockStartParserInterface::tryStart()
-
BlockContinueParserInterface::tryContinue()
-
InlineParserInterface::parse()
-
- How
InlineParserMatch
builds regular expressions, which are then used byInlineParser
I will accept (almost) any PR that aims to improve performance, though I would ask that you keep the following in mind:
- The performance improvement should be measurable, using either our performance benchmark or some other means
- Improvements that don't break BC are preferred, though substantial improvements requiring a major version bump would be considered
- The rationale behind the improvements should either be obvious or have a description in the PR explaining what you did and why
I'm removing the v2.1 milestone as I've already tested a number of expressions and am fairly happy with the current state of things. However, I'll keep this open in case any regex experts want to dig deeper and maybe find something that I missed.
regexes with lots of alternations could be optimized like the one I link to
https://github.com/thephpleague/commonmark/blob/42781fde669f255b7e2ca12ffdcd7ac8d95ee64f/src/Util/RegexHelper.php#L44
several alternations could be reduced by combining similar ones into optional atomic groups, but readability and maintainability go down the toilet and break the sewers. However, I cannot find where that specific regex is used.