eslint-plugin-regexp
eslint-plugin-regexp copied to clipboard
New rule to find bad capturing groups
Description Add a new rule to find bad (= obviously or likely incorrect) capturing groups.
Bad capturing groups are:
- Capturing groups that do not behave as their pattern would suggest. E.g.
/\d+(\d*)/
will always capture the empty string despite containing the pattern\d*
. In general, we should verify that the language of all possible captured strings is equal to the language of the pattern of a capturing group. - Capturing groups that likely behave incorrectly. E.g.
/^(a*).+/u
likely behaves incorrectly for the input stringa
. In general, we should verify that a capturing group cannot exchange characters with the pattern after it. This check should be behind an option to turn it off. It will find likely bad capturing groups, so false positives are to be expected.
Reference: #451, https://github.com/ota-meshi/eslint-plugin-regexp/pull/452/files#r930616165
Examples
/* ✗ BAD */
var foo = /^(a*).+/u
var foo = /\d+(\d*)/u
The analysis for this rule will be quite involved. I don't have enough time this week to implement this, so I'll try to do it next week. Just to let you know.
EDIT: Please forget this comment. I misunderstood.
I think it is difficult to check whether /^(a*).+/u
is unintended by the user.
For example, if we extract tabs and space instead of a
like this, we can use it as a regex to extract a line and its indentation.
/^([\t ]*).+/gmu
Do you think we can distinguish between them?
so false positives are to be expected.
Or do you think it's an acceptable false positives?
const text = `
a
a
a
a
`
const result = [...text.matchAll(/^([\t ]*).+/gmu)].map(([line, indent]) => ({
indent,
line,
}))
console.log(JSON.stringify(result, null, 2))
// [
// {
// "indent": "",
// "line": "a"
// },
// {
// "indent": " ",
// "line": " a"
// },
// {
// "indent": " ",
// "line": " a"
// },
// {
// "indent": " ",
// "line": " a"
// }
// ]
so I'll try to do it next week.
No problem :+1:
Sorry, I noticed that the regular expression was wrong. That doesn't work. It should also be reported.
const text = `
\t
`
const result = [...text.matchAll(/^([\t ]*).+/gmu)].map(([line, indent]) => ({
indent,
line,
}))
console.log(JSON.stringify(result, null, 2))
// [
// {
// "indent": "",
// "line": "\t"
// }
// ]
So I've been trying to implement this rule (I called it no-misleading-capturing-group
), and I ran into a problem:
Capturing groups that do not behave as their pattern would suggest. E.g.
/\d+(\d*)/
I generalized the simple case of "a quantifier preceded by another quantifier" to "a quantifier preceded by a set of quantifiers." This allows us to find more complex problems such as /(?:a\w*|b\d*)(\d*)/u
. In general, we can now answer whether a quantifier A{n,m}
can be simplified to A{n}
based on the quantifiers before it without changing the behavior of the pattern. Simplifying the quantifier also guarantees that all capturing groups still behave the same.
This is interesting because this would allow us to move this fix into optimal-quantifier-concatenation
(OQC). OQC needs to handle capturing groups with care because its current fixes cannot guarantee that they don't change the behavior of the changed capturing group. However, this method of simplifying quantifiers can guarantee that.
The problems found by this new method are both in the domain of OQC and no-misleading-capturing-group
. Should both rules report these problems?
Thank you for starting implementing the no-misleading-capturing-group
rule!
The problems found by this new method are both in the domain of OQC and no-misleading-capturing-group. Should both rules report these problems?
In my opinion both rules should report by default.
OQC can ignore the report with the capturingGroups: "ignore"
option.