flourite icon indicating copy to clipboard operation
flourite copied to clipboard

Missing boundaries to separate keywords from another strings

Open farhan443 opened this issue 2 years ago • 0 comments

Many regex patterns in many languages are missing boundaries to separate the keywords from other strings. Which means they can be matched even if they're inside another word.

Example:

Python's regex that matches class keyword:

/class\s*\w+(\(\s*\w+\s*\))?\s*:/

It can match:

  • def upper-class (param):
  • subclass name(param):
  • classroom1(3):
  • classmate__(_):
  • classic(a):

They're not class declarations but they're still get matched because the regex just look whether they contain "class", and doesn't check whether they're surrounded by another letters.

A simple solution would be to surround the keywords with \b. This will prevent them from being matched when next to other word characters ( [A-Za-z0-9_] ). However, they will still get matched if they're next to punctuations.

This can or can't be a problem depending on the language and the punctuation. In JavaScript, any statement can be preceded by a semicolon, because semicolons are used to terminate statements. The same thing might not be the case in other languages.

Another solution which is pretty common is to surround the keywords with \s. This ensures that they can only be surrounded by whitespaces. This brings another problem because now they can't be matched if they're at the start or the end of the line.

An optimal solution would be to use an alternation and a custom character set to manually define the possible separators. e.g., (^|[\s;,]). While this would be effective, it could be harder to implement because you have to know precisely what are the valid positions and/or characters that could surround them.

farhan443 avatar Nov 08 '21 02:11 farhan443