Confusing behavior of regex parameters
Right now, the regex set implementation used by bindgen adds anchors to user-provided expressions so that they must match the entire input string. I can't find this anywhere in the documentation, and it has caused problems before (#1755). Inspired by this, I did a bit of research.
Of the ~600 crates that depend on bindgen, about half of them use builder functions that accept regexes, for example, the whitelist and blacklist functions. Of those crates, there are around 30 (5% of all dependents, 4% when weighted by download count) that currently add anchors to their regexes that aren't necessary.
Another 30 crates use alternation in their regexes, which is bug-prone because of how it interacts with the implicit anchors. Many of these uses aren't technically correct, even if they do work correctly within the scope of the crate.
I'm creating this issue to start a discussion about whether this is a problem that needs to be fixed, and if so, how we could fix it without causing breakage for existing users.
I understand why the anchors are added, because it makes it easier to specify exact matches, but I'm not happy with the effect it has when regular expressions are being used. When I give a regular expression, I generally expect it to be evaluated as-is. For example, to specify a prefix, I'd expect that I'd be able to write a raw regular expression like ^gl, but in bindgen this won't work because it becomes ^^gl$. The right answer (currently) is gl.*, which then becomes ^gl.*$.
On the other hand, it might be acceptable behavior, considering that grep also has a flag to match the whole line, writing regular expressions that check for exact matches may not be as foreign as I originally thought:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
Also #2195 may be relevant, if it shows that this problem is biting users.
Solved via #2345