pomsky icon indicating copy to clipboard operation
pomsky copied to clipboard

Warn when a quantifier is probably wrong

Open Aloso opened this issue 3 years ago • 0 comments

Pomsky can show a warning when a quantifier is implicitly greedy when it should probably be lazy. This is the case when an overly generic expression (such as . or C) is repeated, which often causes the match to be larger than desired.


Pomsky can also show a warning when a quantifier is implicitly lazy even though it most likely should be greedy. For example:

enable lazy;
'a'* 'b'*

Note that this is much harder to implement.

Why is this bad? When part of a regex is repeated lazily and not followed by anything else, it will repeat as few times as possible (in this case, never, so all matches are empty). So the above is equivalent to 'a'*, or even ''. This is probably not the intention of the person who wrote this expression. What they meant is 'a'* greedy 'b'* greedy, matching any number of as, followed by any number of bs.

Another example:

enable lazy;
'a'* 'b' ('c'* | 'd'?)

The first repetition doesn't need to be greedy. Since it must be followed by a b, the regex engine has to repeat a until a b is encountered. However, both alternatives in the group ('c'* | 'd'?) are lazy, so the regex engine can stop before even trying to find a c or d.

The rule I'd like to enforce is this: If a repetition is possibly last, it should be greedy. An expression is possibly last if it is followed only by optional expressions:

  • the empty string is optional
  • any repetition is optional, if it is allowed to repeat 0 times
  • a group is optional if every expression inside it is optional
  • an alternation is optional if it contains at least one optional branch
  • lookarounds are optional if everything they contain is optional
  • in addition, lookbehinds are optional if they are guaranteed to match after the previous non-optional expression (checking this algorithmically is very hard)
  • backreferences are optional if the group they reference would be optional at that position
  • recursion is optional if the entire expression is optional
  • character classes and boundaries are never optional

Pomsky can iterate from back to front and check for each expression, recursively, if it is optional. If it encounters a non-greedy, non-constant repetition, it should issue a warning that it should be greedy. When it encounters a non-optional expression, stop iterating at that level after issuing the aforementioned warning, if it applies.

Aloso avatar Feb 27 '22 03:02 Aloso