closure-compiler
closure-compiler copied to clipboard
feat: add full Unicode support for Javascript identifiers
Further resolves #3639, aims to prevent 99.999% of future issues that might occur from using non-English characters (except the few Greek and Latin chars manually included) in an identifier. In the current build, the following errors will occur (including a known issue with U+FF3F
):
ERROR - [JSC_PARSE_ERROR] Parse error. Character '〩' (U+3029) is not a valid identifier start char
1| var a〩 = 10;
^
ERROR - [JSC_PARSE_ERROR] Parse error. Character '〺' (U+303A) is not a valid identifier start char
1| var 〺b = 10;
^
The caret is marked in the wrong spot for the last example, but the msg makes it clear that it's U+FF3F
causing problems:
ERROR - [JSC_PARSE_ERROR] Parse error. Character '_' (U+FF3F) is not a valid identifier start char
1| var aesthetic_;
^
After:
var a〩 = 10;
var a\u3029=10;
var 〺b = 10;
var \u303ab=10;
var aesthetic_;
var \uff41\uff45\uff53\uff54\uff48\uff45\uff54\uff49\uff43\uff3f;
Existing logic is retained for isIdentifierStart
and isIdentifierPart
, but if the character exists outside of the hardcoded ranges it is checked against regexpu-compiled patterns provided by TC39.
I would not expect that regexpu approach is very performant, but all previous optimizations were migrated into the UnicodeMatch
class, so the only impact is if you use a previously unsupported character (which would have thrown an error and exited altogether prior to this patch). I will likely transform the matches into large (a <= ch & ch <= b)
blocks to further optimize, but I wanted to put this up for review and feedback (especially regarding short-circuiting and branching conditions, which I am not super familiar with).
Apologies if this is an unwanted PR, I just figured I'd take a crack at it.
Update
I figured manually dumping a chain of bitwise operators would be faster than regexpu, but the compiled regex is consistently faster than the raw bitwise chain unless short-circuiting is used: https://repl.it/@christiantjl/UnicodeMatch
Jupyter notebook for generating the bitwise chain at: https://github.com/christiantjl/unicode-category-ranges/blob/master/Unicode.ipynb
This isn't merged yet. I'm testing this internally and will update soon.
I'm seeing multiple internal targets failing to build with this error message - 'The import java.util.regex cannot be resolved'.
I'll take another stab at this today and repro+debug the failing tests.
Hey @rishipal, I saw your comment about errors, I did not see any when I submitted the PR. Apologies for the delay, I have not had much free time to review this.
The 'The import java.util.regex cannot be resolved' failures get fixed after switching to re2j regex library. However, I see multiple Scanner tests failing with that. I will update shortly.
We have dropped support for the JS (GWT version) of the Closure Compiler in favor of the Graal native image versions. Do you feel that this is still useful?
@concavelenz As I understand it, this was an issue regardless of build - it was just an issue of extended Unicode characters not accurately being counted as valid identifier parts, IIRC (this work is not super fresh in my mind, apologies).
Update: I tested this again over the weekend. The internal tests are passing now. I will commit this internally.
@rishipal Thank you for your help on this PR. Was this merged internally?
Sorry to bother!
Hi, this is not merged yet. A comment from John on the pending CL suggests that this may not be needed after we stopped support for GWT/J2CL version of the compiler. I will confirm the priority of this and respond by EOD tomorrow.
The Scanner test failures did not get fully resolved contrary to my https://github.com/google/closure-compiler/pull/3647#issuecomment-779989990. This PR is decided to be deprioritized and shelved.