T-Regx icon indicating copy to clipboard operation
T-Regx copied to clipboard

Provide interface for iterative matching, to avoid catastrophic backtracking

Open danon opened this issue 3 years ago • 2 comments

Currently, users of PHP regexp have only two choices:

  • preg_match_all(), which performs all available matches right away
  • preg_match(), which performs only a single match

The problem with preg_match_all(), is that sometimes users need only the first 2 or 3 matches, where 4th match would cause catastrophic backtracking. Currently, it may cause users to use substr() or $offset to use preg_match() to find next calls, because preg_match_all() doesn't suffice.

This could be worked with, because preg_replace and preg_replace_callback() for example provide $limit parametr, which can control very precisely how many calls will be done, but with matching, it's not possible.

danon avatar Dec 01 '21 23:12 danon

This may not be possible.

It is only possible, if the assumption is correct, that the next match should be made at offset which is the sum of the previous offset and the length of the first match, excluding capturing groups.

Is this assumption always true?

function nextOffset(string $match, int $offset): int {
  return $offset + \strLen($match);
}

Even if you include \K resets, that will still hold, and look-arounds with groups shouldn't change anything, but is there anything else missing here?

PS: Anchoring with A also works fine.

danon avatar Aug 02 '22 21:08 danon

Perhaps use limits of preg_replace()?

danon avatar Sep 07 '22 21:09 danon