opal icon indicating copy to clipboard operation
opal copied to clipboard

More UTF-16 support; simple Regexp transpiler

Open hmdne opened this issue 9 months ago • 2 comments

A little known thing about JavaScript is that it uses UTF-16 encoding for its strings. But to leverage full extent of UTF-16 support, one must use correct functions, otherwise we are left with not supported over-the-BMP characters, like now ubiquitous emoji.

This commit also makes most regexps use Unicode mode. Due to the Unicode mode regexps being more strict, we now really need a half a decent transpiler. That's also what it adds and using that situation, we also add support for POSIX character classes, which are quire often used in Ruby, but aren't there in JS, so we simulate them with Unicode character classes.

As a side effect, this made us support value omission for hashes when compiling with Opal in JS (eg. when using eval). Since all the MSpec tests do this, we pass the tests now.

We also add a proper support for multiline regular expressions. Semantics between how multiline works in Ruby and JS is very big, as in, those are basically two different features. This commit aims to reconcile those two features in the most straightforward way. This commit introduces quite proper handling of all "\A", "\z", "$", "^". It is our opinion, that a regexp will contain only one set of those in which case things will work correctly. If not, then we launch a warning.

Regexps are now annotated if needed. This means, that if a certain regexp has been transpiled and the transpilation result differs, the copy of the original Regexp will be preserved, so that further manipulations on that Regexp, for instance Regexp.union, will work on an original Regexp.

This PR has been sponsored by Ribose Inc.

hmdne avatar Nov 09 '23 08:11 hmdne

The performance impact must be investigated.

hmdne avatar Nov 26 '23 12:11 hmdne

The third iteration of this patch fixes a problem where a regexp like [^a] would be treated as containing ^ assertion.

hmdne avatar Nov 29 '23 14:11 hmdne