closure-compiler feat: add full Unicode support for Javascript identifiers

Further resolves #3639, aims to prevent 99.999% of future issues that might occur from using non-English characters (except the few Greek and Latin chars manually included) in an identifier. In the current build, the following errors will occur (including a known issue with U+FF3F):

ERROR - [JSC_PARSE_ERROR] Parse error. Character '〩' (U+3029) is not a valid identifier start char
  1| var a〩 = 10;
          ^

ERROR - [JSC_PARSE_ERROR] Parse error. Character '〺' (U+303A) is not a valid identifier start char
  1| var 〺b = 10;
         ^

The caret is marked in the wrong spot for the last example, but the msg makes it clear that it's U+FF3F causing problems:

ERROR - [JSC_PARSE_ERROR] Parse error. Character '＿' (U+FF3F) is not a valid identifier start char
  1| var ａｅｓｔｈｅｔｉｃ＿;
                  ^

After:

var a〩 = 10;
var a\u3029=10;

var 〺b = 10;
var \u303ab=10;

var ａｅｓｔｈｅｔｉｃ＿;
var \uff41\uff45\uff53\uff54\uff48\uff45\uff54\uff49\uff43\uff3f;

Existing logic is retained for isIdentifierStart and isIdentifierPart, but if the character exists outside of the hardcoded ranges it is checked against regexpu-compiled patterns provided by TC39.

I would not expect that regexpu approach is very performant, but all previous optimizations were migrated into the UnicodeMatch class, so the only impact is if you use a previously unsupported character (which would have thrown an error and exited altogether prior to this patch). I will likely transform the matches into large (a <= ch & ch <= b) blocks to further optimize, but I wanted to put this up for review and feedback (especially regarding short-circuiting and branching conditions, which I am not super familiar with).

Apologies if this is an unwanted PR, I just figured I'd take a crack at it.

Update

I figured manually dumping a chain of bitwise operators would be faster than regexpu, but the compiled regex is consistently faster than the raw bitwise chain unless short-circuiting is used: https://repl.it/@christiantjl/UnicodeMatch

Jupyter notebook for generating the bitwise chain at: https://github.com/christiantjl/unicode-category-ranges/blob/master/Unicode.ipynb

Jul 21 '20 02:07 ctjlewis

This isn't merged yet. I'm testing this internally and will update soon.

Oct 29 '20 21:10 rishipal

I'm seeing multiple internal targets failing to build with this error message - 'The import java.util.regex cannot be resolved'.

Nov 04 '20 04:11 rishipal

I'll take another stab at this today and repro+debug the failing tests.

Dec 11 '20 17:12 rishipal

Hey @rishipal, I saw your comment about errors, I did not see any when I submitted the PR. Apologies for the delay, I have not had much free time to review this.

Dec 11 '20 17:12 ctjlewis

The 'The import java.util.regex cannot be resolved' failures get fixed after switching to re2j regex library. However, I see multiple Scanner tests failing with that. I will update shortly.

Dec 14 '20 17:12 rishipal

We have dropped support for the JS (GWT version) of the Closure Compiler in favor of the Graal native image versions. Do you feel that this is still useful?

Feb 11 '21 20:02 concavelenz

@concavelenz As I understand it, this was an issue regardless of build - it was just an issue of extended Unicode characters not accurately being counted as valid identifier parts, IIRC (this work is not super fresh in my mind, apologies).

Feb 14 '21 22:02 ctjlewis

Update: I tested this again over the weekend. The internal tests are passing now. I will commit this internally.

Feb 16 '21 17:02 rishipal

@rishipal Thank you for your help on this PR. Was this merged internally?

Sorry to bother!

Mar 05 '21 06:03 ctjlewis

Hi, this is not merged yet. A comment from John on the pending CL suggests that this may not be needed after we stopped support for GWT/J2CL version of the compiler. I will confirm the priority of this and respond by EOD tomorrow.

Mar 10 '21 01:03 rishipal

The Scanner test failures did not get fully resolved contrary to my https://github.com/google/closure-compiler/pull/3647#issuecomment-779989990. This PR is decided to be deprioritized and shelved.

Mar 10 '21 17:03 rishipal

closure-compiler closure-compiler copied to clipboard

feat: add full Unicode support for Javascript identifiers

Update

closure-compiler
closure-compiler copied to clipboard