commonmark icon indicating copy to clipboard operation
commonmark copied to clipboard

Optimize regular expressions

Open colinodell opened this issue 3 years ago • 2 comments

This library makes heavy use of regular expressions. While most of them should be fairly performant, there could certainly be some room for improvement to help improve the performance of this library. Examples of improvements might include:

  1. Replacing non-regex parsing logic with regular expressions (if that's quicker)
  2. Replacing regex-based parsing with logic that doesn't use regular expressions (if that's quicker)
  3. Combining multiple regexes into one (if that's quicker)
  4. Fixing excessive backtracking in expressions
  5. Other improvements to existing expressions
  6. ???

Tools that could help here include:

  • The debugger on https://regex101.com/, especially to check for excessive backtracking
  • Our benchmark.php script
  • A performance profiler like Blackfire

A partial list of areas where regex is used in this library include:

  • https://github.com/thephpleague/commonmark/blob/main/src/Util/RegexHelper.php
  • https://github.com/thephpleague/commonmark/blob/main/src/Parser/Cursor.php
  • Implementations of:
    • BlockStartParserInterface::tryStart()
    • BlockContinueParserInterface::tryContinue()
    • InlineParserInterface::parse()
  • How InlineParserMatch builds regular expressions, which are then used by InlineParser

I will accept (almost) any PR that aims to improve performance, though I would ask that you keep the following in mind:

  • The performance improvement should be measurable, using either our performance benchmark or some other means
  • Improvements that don't break BC are preferred, though substantial improvements requiring a major version bump would be considered
  • The rationale behind the improvements should either be obvious or have a description in the PR explaining what you did and why

colinodell avatar Jun 19 '21 14:06 colinodell

I'm removing the v2.1 milestone as I've already tested a number of expressions and am fairly happy with the current state of things. However, I'll keep this open in case any regex experts want to dig deeper and maybe find something that I missed.

colinodell avatar Nov 07 '21 17:11 colinodell

regexes with lots of alternations could be optimized like the one I link to

https://github.com/thephpleague/commonmark/blob/42781fde669f255b7e2ca12ffdcd7ac8d95ee64f/src/Util/RegexHelper.php#L44

several alternations could be reduced by combining similar ones into optional atomic groups, but readability and maintainability go down the toilet and break the sewers. However, I cannot find where that specific regex is used.

live627 avatar Mar 22 '23 12:03 live627