eslint-plugin-regexp icon indicating copy to clipboard operation
eslint-plugin-regexp copied to clipboard

New rule to find bad capturing groups

Open RunDevelopment opened this issue 1 year ago • 3 comments

Description Add a new rule to find bad (= obviously or likely incorrect) capturing groups.

Bad capturing groups are:

  • Capturing groups that do not behave as their pattern would suggest. E.g. /\d+(\d*)/ will always capture the empty string despite containing the pattern \d*. In general, we should verify that the language of all possible captured strings is equal to the language of the pattern of a capturing group.
  • Capturing groups that likely behave incorrectly. E.g. /^(a*).+/u likely behaves incorrectly for the input string a. In general, we should verify that a capturing group cannot exchange characters with the pattern after it. This check should be behind an option to turn it off. It will find likely bad capturing groups, so false positives are to be expected.

Reference: #451, https://github.com/ota-meshi/eslint-plugin-regexp/pull/452/files#r930616165

Examples

/* ✗ BAD */
var foo = /^(a*).+/u
var foo = /\d+(\d*)/u

RunDevelopment avatar Jul 27 '22 08:07 RunDevelopment

The analysis for this rule will be quite involved. I don't have enough time this week to implement this, so I'll try to do it next week. Just to let you know.

RunDevelopment avatar Jul 27 '22 09:07 RunDevelopment

EDIT: Please forget this comment. I misunderstood.


I think it is difficult to check whether /^(a*).+/u is unintended by the user.

For example, if we extract tabs and space instead of a like this, we can use it as a regex to extract a line and its indentation.

/^([\t ]*).+/gmu

Do you think we can distinguish between them?

so false positives are to be expected.

Or do you think it's an acceptable false positives?

const text = `
a
 a
  a
   a
`
const result = [...text.matchAll(/^([\t ]*).+/gmu)].map(([line, indent]) => ({
    indent,
    line,
}))
console.log(JSON.stringify(result, null, 2))
// [
//   {
//     "indent": "",
//     "line": "a"
//   },
//   {
//     "indent": " ",
//     "line": " a"
//   },
//   {
//     "indent": "  ",
//     "line": "  a"
//   },
//   {
//     "indent": "   ",
//     "line": "   a"
//   }
// ]

so I'll try to do it next week.

No problem :+1:

ota-meshi avatar Jul 27 '22 09:07 ota-meshi

Sorry, I noticed that the regular expression was wrong. That doesn't work. It should also be reported.

const text = `
\t
`
const result = [...text.matchAll(/^([\t ]*).+/gmu)].map(([line, indent]) => ({
    indent,
    line,
}))
console.log(JSON.stringify(result, null, 2))
// [
//   {
//     "indent": "",
//     "line": "\t"
//   }
// ]

ota-meshi avatar Jul 27 '22 09:07 ota-meshi

So I've been trying to implement this rule (I called it no-misleading-capturing-group), and I ran into a problem:

Capturing groups that do not behave as their pattern would suggest. E.g. /\d+(\d*)/

I generalized the simple case of "a quantifier preceded by another quantifier" to "a quantifier preceded by a set of quantifiers." This allows us to find more complex problems such as /(?:a\w*|b\d*)(\d*)/u. In general, we can now answer whether a quantifier A{n,m} can be simplified to A{n} based on the quantifiers before it without changing the behavior of the pattern. Simplifying the quantifier also guarantees that all capturing groups still behave the same.

This is interesting because this would allow us to move this fix into optimal-quantifier-concatenation (OQC). OQC needs to handle capturing groups with care because its current fixes cannot guarantee that they don't change the behavior of the changed capturing group. However, this method of simplifying quantifiers can guarantee that.

The problems found by this new method are both in the domain of OQC and no-misleading-capturing-group. Should both rules report these problems?

RunDevelopment avatar Nov 17 '22 21:11 RunDevelopment

Thank you for starting implementing the no-misleading-capturing-group rule!

The problems found by this new method are both in the domain of OQC and no-misleading-capturing-group. Should both rules report these problems?

In my opinion both rules should report by default. OQC can ignore the report with the capturingGroups: "ignore" option.

ota-meshi avatar Nov 18 '22 00:11 ota-meshi