rust-bindgen icon indicating copy to clipboard operation
rust-bindgen copied to clipboard

Confusing behavior of regex parameters

Open agausmann opened this issue 5 years ago • 2 comments

Right now, the regex set implementation used by bindgen adds anchors to user-provided expressions so that they must match the entire input string. I can't find this anywhere in the documentation, and it has caused problems before (#1755). Inspired by this, I did a bit of research.

Of the ~600 crates that depend on bindgen, about half of them use builder functions that accept regexes, for example, the whitelist and blacklist functions. Of those crates, there are around 30 (5% of all dependents, 4% when weighted by download count) that currently add anchors to their regexes that aren't necessary.

Another 30 crates use alternation in their regexes, which is bug-prone because of how it interacts with the implicit anchors. Many of these uses aren't technically correct, even if they do work correctly within the scope of the crate.

I'm creating this issue to start a discussion about whether this is a problem that needs to be fixed, and if so, how we could fix it without causing breakage for existing users.

agausmann avatar May 14 '20 19:05 agausmann

I understand why the anchors are added, because it makes it easier to specify exact matches, but I'm not happy with the effect it has when regular expressions are being used. When I give a regular expression, I generally expect it to be evaluated as-is. For example, to specify a prefix, I'd expect that I'd be able to write a raw regular expression like ^gl, but in bindgen this won't work because it becomes ^^gl$. The right answer (currently) is gl.*, which then becomes ^gl.*$.

On the other hand, it might be acceptable behavior, considering that grep also has a flag to match the whole line, writing regular expressions that check for exact matches may not be as foreign as I originally thought:

       -x, --line-regexp
              Select only those matches that exactly match the whole line.
              For a regular expression pattern, this is like parenthesizing
              the pattern and then surrounding it with ^ and $.

agausmann avatar May 14 '20 19:05 agausmann

Also #2195 may be relevant, if it shows that this problem is biting users.

kulp avatar Jun 02 '22 10:06 kulp

Solved via #2345

pvdrz avatar Nov 14 '22 17:11 pvdrz