super-expressive icon indicating copy to clipboard operation
super-expressive copied to clipboard

Rewrite anythingButString in terms of a lookahead/consume any character

Open francisrstokes opened this issue 2 years ago • 0 comments

Right now anythingButString is implemented in a very non-ideal way (see #58). The plan is to replace the existing function, and potentially add one more.

anythingButString('aeiou') will produce output like:

// non-capturing group, containing a lookahead for exact string, then matching any characters repeatedly for inputString.length
/(?:(?!aeiou).{5})/

This implementation will only work predictably for ascii-type strings, because length actually counts UTF-16 codepoints. The same unicode characters can be encoded in multiple distinct ways due to the fact that UTF-16 is not normalised.

To provide an API that is also able to deal with unicode strings, something like anythingButStringUnicode(inputString, numCharactersToMatch) could be added. In this case, the user would be expected to provide the actual number of characters that should be matched after the lookahead. This is kind of fraught in itself due to normalisation, and the fact that whatever string you'd want to match in place may not match the number of code points anyway.

I imagine that this API would still cause confusion with users, both those looking explicitly to match unicode strings, and those who assume they should use this version of the function because why wouldn't you use unicode? In that case, it may be better to skip it altogether, and allow the user to use the group/assertAhead/anyChar/exactly APIs to build the equivalent manually. Though in that case, it still might be worth adding a anyDataUnit as a low-level API for unicode matching.

francisrstokes avatar Nov 11 '22 08:11 francisrstokes