opal
opal copied to clipboard
More UTF-16 support; simple Regexp transpiler
A little known thing about JavaScript is that it uses UTF-16 encoding for its strings. But to leverage full extent of UTF-16 support, one must use correct functions, otherwise we are left with not supported over-the-BMP characters, like now ubiquitous emoji.
This commit also makes most regexps use Unicode mode. Due to the Unicode mode regexps being more strict, we now really need a half a decent transpiler. That's also what it adds and using that situation, we also add support for POSIX character classes, which are quire often used in Ruby, but aren't there in JS, so we simulate them with Unicode character classes.
As a side effect, this made us support value omission for hashes
when compiling with Opal in JS (eg. when using eval
). Since all
the MSpec tests do this, we pass the tests now.
We also add a proper support for multiline regular expressions. Semantics between how multiline works in Ruby and JS is very big, as in, those are basically two different features. This commit aims to reconcile those two features in the most straightforward way. This commit introduces quite proper handling of all "\A", "\z", "$", "^". It is our opinion, that a regexp will contain only one set of those in which case things will work correctly. If not, then we launch a warning.
Regexps are now annotated if needed. This means, that if a certain
regexp has been transpiled and the transpilation result differs,
the copy of the original Regexp will be preserved, so that further
manipulations on that Regexp, for instance Regexp.union
, will
work on an original Regexp.
This PR has been sponsored by Ribose Inc.
The performance impact must be investigated.
The third iteration of this patch fixes a problem where a regexp like [^a]
would be treated as containing ^
assertion.