oniguruma icon indicating copy to clipboard operation
oniguruma copied to clipboard

Don't expand character match length with flag `i` (unless using a new flag)

Open slevithan opened this issue 1 year ago • 4 comments

Currently, Oniguruma sometimes applies Unicode's SpecialCasing.txt rules when using flag i, which can lengthen the match of a character, character class, or set (like \w or \S). For example, (?i)^ß$ matches 'ss', and (?i)^ss$ matches 'ß'.

I don't think Oniguruma should do that, unless the behavior is applied behind a dedicated flag or option. And if such a flag was added, that would allow applying the behavior more consistently than it is now, since users would be opting in and it could be documented that there are performance implications.

Following is my understanding of the reasons for and against the current behavior. Are there additional reasons I'm missing?

Reasons to continue expanding the length of a match

  • Changing it now would be a breaking change.
  • It follows Unicode recommendations and the Unicode org's ICU regex engine.
  • It might be a big or complex change in the code (a lot of work).
  • It sometimes solves a real issue with complex case differences.

Opinion: Even though it solves a real casing problem, the problem usually isn't relevant in the context of regular expressions. And when it is, there's usually an easy workaround, or the problem didn't need solving in the first place (for example, because the user wrote \w+ and so it would already match both ß and ss).

Reasons to stop expanding the length

  • It is currently applied inconsistently anyway, based on complicated and nonintuitive conditions that few users will understand (I will show examples below).
  • The fact that e.g. (?i)\w and (?i)[\w] are not equivalent makes it hard to reason about or refactor regexes (similar to #349).
  • It hurts performance (sometimes catastrophically), as discussed in #350.
  • It cannot be fully "fixed" (applied consistently) without hurting performance even more, for things like the dot ..
  • It is not portable with other regex flavors that don't do this, including Perl, PCRE2, JavaScript, Java, .NET, and Rust.
  • Most of the time, the behavior is surprising (this would change if users had to opt in).
  • Most of the time, the behavior is undesired, since it is simply wrong in terms of following user intent. For example, if someone uses \S in a character class, in essentially 100% of cases they do not mean to match 'ss', 'ff', 'fl', etc. They mean "any single character that is not whitespace". And despite "ss" being the case conversion of a single character, it is not itself a single character in any context/language.

If you accept my statements above, unfortunately it means that, in exchange for ① the added complexity in the engine, ② the inconsistency/unpredictability for users, and ③ the resulting performance problems, users get behavior that in almost all cases they didn't expect or want.

Recent precedent from JavaScript

JavaScript is an interesting regex flavor to compare to, because in version ES2024 it added flag v (unicodeSets), which allows character classes and Unicode properties to match more than one character at a time using a few specific "properties of strings" like \p{RGI_Emoji} (which can also be used in character classes) or the new syntax […\q{…|…}]. However, even though JavaScript character classes and Unicode properties can now match more than one character, and even though JavaScript flags u/v change flag i to use Unicode case folding, nevertheless JavaScript did not chose to apply Unicode's special casing rules that change match length (like ßss).

Current Oniguruma behavior

Following are the tests I ran to help me understand the current behavior. It shows the regex and target string for each test. r is for raw strings (without backslash escaping).

✅ = match ❌ = no match 🤔 = inconsistent or questionable behavior 🤯 = very surprising

[
  // Single `s` doesn't map to small sharp s (German eszett, ß) or its case equivalents
  [r`(?i)^s$`, 'ß'], // ❌
  [r`(?i)^s$`, 'ss'], // ❌
  [r`(?i)^[s]$`, 'ß'], // ❌
  [r`(?i)^[s]$`, 'ss'], // ❌
  [r`(?i)^ß$`, 's'], // ❌
  [r`(?i)^[ß]$`, 's'], // ❌

  // Single `s` does map to its case equivalent small long s
  [r`(?i)^s$`, 'ſ'], // ✅

  // Single `ß` maps to `ss` and its case equivalents
  [r`(?i)^ß$`, 'ß'], // ✅
  [r`(?i)^ß$`, 'ss'], // ✅
  [r`(?i)^ß$`, 'SS'], // ✅
  [r`(?i)^ß$`, 'ſſ'], // ✅
  [r`(?i)^ß$`, 'sS'], // ✅
  [r`(?i)^ß$`, 'sſ'], // ✅
  [r`(?i)^ß$`, 'Ss'], // ✅
  [r`(?i)^ß$`, 'Sſ'], // ✅
  [r`(?i)^ß$`, 'ſs'], // ✅
  [r`(?i)^ß$`, 'ſS'], // ✅
  [r`(?i)^ß$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
  [r`(?i)^ẞ$`, 'ß'], // ✅ Uppercase `ẞ` in pattern

  // The same, within a positive class
  [r`(?i)^[ß]$`, 'ß'], // ✅
  [r`(?i)^[ß]$`, 'ss'], // ✅
  [r`(?i)^[ß]$`, 'SS'], // ✅
  [r`(?i)^[ß]$`, 'ſſ'], // ✅
  [r`(?i)^[ß]$`, 'sS'], // ✅
  [r`(?i)^[ß]$`, 'sſ'], // ✅
  [r`(?i)^[ß]$`, 'Ss'], // ✅
  [r`(?i)^[ß]$`, 'Sſ'], // ✅
  [r`(?i)^[ß]$`, 'ſs'], // ✅
  [r`(?i)^[ß]$`, 'ſS'], // ✅
  [r`(?i)^[ß]$`, 'ẞ'], // ✅ Uppercase `ẞ` in target
  [r`(?i)^[ẞ]$`, 'ß'], // ✅ Uppercase `ẞ` in pattern

  // Negated class basics; nothing surprising here
  [r`(?i)^[^ß]$`, 'ß'], // ❌
  [r`(?i)^[^ß]$`, 'ss'], // ❌
  [r`(?i)^[^s]$`, 'ß'], // ✅
  [r`(?i)^[^s]$`, 'ss'], // ❌
  [r`(?i)^[^ſ]$`, 'ß'], // ✅
  [r`(?i)^[^ſ]$`, 'ss'], // ❌
  [r`(?i)^[^ẞ]$`, 'ß'], // ❌ Uppercase `ẞ` in pattern
  [r`(?i)^[^ẞ]$`, 'ss'], // ❌ Uppercase `ẞ` in pattern

  // Other representations of exactly `ß` are OK
  [r`(?i)^\x{DF}$`, 'ss'], // ✅

  // But not sets that include `ß` 🤔
  [r`(?i)^\w$`, 'ss'], // ❌
  [r`(?i)^\p{Word}$`, 'ss'], // ❌
  [r`(?i)^\D$`, 'ss'], // ❌
  [r`(?i)^.$`, 'ss'], // ❌
  [r`(?i)^\O$`, 'ss'], // ❌
  [r`(?i)^\p{Any}$`, 'ss'], // ❌

  // Within positive classes, other representations of `ß`, and sets/ranges that include `ß`, are OK
  [r`(?i)^[\x{DF}]$`, 'ss'], // ✅
  [r`(?i)^[\x{DE}-\x{E0}]$`, 'ss'], // ✅
  [r`(?i)^[\w]$`, 'ss'], // ✅
  [r`(?i)^[\p{Word}]$`, 'ss'], // ✅
  [r`(?i)^[[:word:]]$`, 'ss'], // ✅
  [r`(?i)^[\D]$`, 'ss'], // ✅
  [r`(?i)^[\P{M}]$`, 'ss'], // ✅
  [r`(?i)^[\p{Any}]$`, 'ss'], // ✅

  // But not within negated classes 🤔
  [r`(?i)^[^[^\x{DF}]]$`, 'ss'], // ❌
  [r`(?i)^[^\0]$`, 'ss'], // ❌
  [r`(?i)^[^\W]$`, 'ss'], // ❌
  [r`(?i)^[^\d]$`, 'ss'], // ❌
  [r`(?i)^[^\p{M}]$`, 'ss'], // ❌

  // The negation rule is about negation of the outermost class, only 🤔
  [r`(?i)^[^[\W]]$`, 'ss'], // ❌
  [r`(?i)^[[^\W]]$`, 'ss'], // ✅ 🤯
  [r`(?i)^[\w&&[^\W]]$`, 'ss'], // ✅ 🤯

  // Flags `W` and `P` exclude `ß` from `\w`
  [r`(?iP)^[\w]$`, 'ss'], // ❌
  [r`(?iW)^[\w]$`, 'ss'], // ❌
  [r`(?iW)^\w$`, 'ss'], // ❌
  [r`(?iW)^[ß]$`, 'ss'], // ✅
  [r`(?iW)^ß$`, 'ss'], // ✅
  [r`(?iW)^\x{DF}$`, 'ss'], // ✅

  // Quantifier basics; nothing surprising here
  [r`(?i)^ß{2}$`, 'ßß'], // ✅
  [r`(?i)^ß{2}$`, 'ss'], // ❌
  [r`(?i)^ß{2}$`, 'ssss'], // ✅
  [r`(?i)^[ß]{2}$`, 'ss'], // ❌
  [r`(?i)^[ß]{2}$`, 'ssss'], // ✅
  [r`(?i)^[^ß]{2}$`, 'ss'], // ✅
  [r`(?i)^[^ß]{2}$`, 'ssss'], // ❌

  // Character classes are affected by backtracking (bad news for performance!) 🤔
  [r`(?i)^[\w]{2}$`, 'ss'], // ✅
  [r`(?i)^[\w]{2}$`, 'sss'], // ✅
  [r`(?i)^[\w]{2}$`, 'ssss'], // ✅

  // In the reverse direction
  [r`(?i)^ss$`, 'ss'], // ✅
  [r`(?i)^ss$`, 'ß'], // ✅
  [r`(?i)^ſſ$`, 'ß'], // ✅
  [r`(?i)^ss$`, 'ẞ'], // ✅ Uppercase `ẞ`

  // In the reverse direction with quantifiers; nothing surprising here
  [r`(?i)^s{2}$`, 'ß'], // ❌
  [r`(?i)^ss{2}$`, 'ßß'], // ❌
  [r`(?i)^(?:ss){2}$`, 'ßß'], // ✅
]

slevithan avatar Apr 18 '25 20:04 slevithan

If you are open to the breaking change proposed above, I see at least three paths:

  1. Simply no longer provide this functionality. Although this feels like a loss, it would match the majority of other modern regex flavors (though there are exceptions like ICU and Swift).
  2. Preserve the functionality behind a new flag, whole-pattern modifier, or compile-time option.
  3. Provide the functionality via a new metacharacter. This could be similar to but different than \O and \X. For the sake of example, I'll call it \U. To narrow what it matches, you could do things like (?=\w)\U. And if you also adopted set subtraction like […--…] (from several regex flavors include JavaScript), you could do [\U--\W] (which would be equivalent to Oniguruma's current (?i)[\w]).

slevithan avatar Apr 19 '25 14:04 slevithan

I would consider changing the spec to not match different number of characters with flag i.

kkos avatar Apr 20 '25 15:04 kkos

I'm guessing you already understand Unicode much better than me, but in case it's helpful, what JavaScript seems to do when both the i and u (or v) flag are set is that all symbols are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared.

JS also ensures that /\w/iu is the inverse of /\W/iu. This is relevant since JS's \w is ASCII-only [0-9A-Z_a-z], which means that \W matches U+017F and U+212A. But /\w/iu is equivalent to [0-9A-Z_a-z\u017F\u212A], so /\W/iu is the inverse of /\w/iu. That means /\W/iu is equivalent to /[^0-9A-Z_a-z\u017F\u212A]/u.

slevithan avatar Apr 21 '25 13:04 slevithan

Fixed #351 in branch fix_351_349.

kkos avatar Apr 22 '25 14:04 kkos