closure-compiler
closure-compiler copied to clipboard
Piping Closure compiler stderr output to Python with Unicode characters on Windows problem
STR:
a.py
import subprocess
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)
a.js
if (4 == NaN) console.log('á');
generates an error
C:\emsdk\emscripten\main>python a.py
Traceback (most recent call last):
File "C:\emsdk\emscripten\main\a.py", line 2, in <module>
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)
File "C:\Python311\Lib\subprocess.py", line 550, in run
stdout, stderr = process.communicate(input, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Python311\Lib\subprocess.py", line 1197, in communicate
stderr = self.stderr.read()
^^^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 135: invalid continuation byte
My impression here is that Closure has emitted the ISO-8859-1 encoding value of á to stderr, which has the hex value of 0xe1. However, the encoding='utf-8' argument in Python expects the stderr to be printed out as UTF-8.
I could not find a command line directive in https://github.com/google/closure-compiler/wiki/Flags-and-Options to help control Closure stdout/stderr output encoding.
Which encoding does Closure use for stdout/stderr printing? Is it ISO-8859-1 by intent? Or should it have been UTF-8 and Closure accidentally printed out ISO-8859-1?
I cannot tell from the example a.js file in the description whether the á character is correctly encoded as UTF-8 in the file you're actually using when you see this error.
Can you confirm that the input file, a.js is actually correct utf-8?
Actually, could you just attach 2 files to this issue?
- The actual
a.jsfile. - The exact output from closure-compiler itself. (i.e. the input that python is seeing)
Here are the input files: a.zip
C3 A1 is 11000011 10100001, which is of form 110xxxxx 10yyyyyy, i.e. a leading code point and a continution code point. See e.g. Wikipedia on UTF-8 Encoding. The Unicode code point in this case will be xxxxxyyyyyy = 00011 100001 = 0xE1 = https://www.compart.com/en/unicode/U+00E1.
The exact output from closure-compiler itself. (i.e. the input that python is seeing)
The test case does not produce any JavaScript output from closure-compiler. Python attempts to capture the stderr error message from Closure process, but Python croaks internally since it cannot decode the stderr bytes that Closure is outputting, and so does not produce any output to the calling a.py file.
Executing the following python file instead
import subprocess
ret = subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='iso-8859-1', stderr=subprocess.PIPE, shell=True)
print(ret.stderr)
does not throw an exception, and instead causes Python to print the stderr as expected:
a.js:1:4: WARNING - [JSC_SUSPICIOUS_NAN] Comparison against NaN is always false. Did you mean isNaN()?
1| if (4 == NaN) console.log('á');
^^^^^^^^
What I want to know is this:
Is closure-compiler actually generating an invalid character sequence to stderr, or is something else going on?
One thing that could be happening is that the stderr output from closure-compiler could be getting mixed with output from either its own stdout or output from some other process that happens to share the same output stream. Due to buffering, the 2-character sequence for 'á' closure-compiler sends to stderr could be interrupted by output from somewhere else..
Thanks for providing the a.js file and your command line. We can use that to find out what the actual stderr output from the latest closure-compiler build is for this case.
If this problem is in some way actually tied to Windows, we're unlikely to fix it ourselves as none of the core team uses Windows when working on closure-compiler.
Thank you for supplying the a.js file.
- I downloaded it
- I checked out and built the latest version of closure-compiler as a Java jar file.
- I stored the path to that jar file in
$ccjar - I ran the following commands to check the behavior.
First confirm that my terminal / OS is using UTF-8
$ echo $LANG
en_US.UTF-8
$ echo á |xxd
00000000: c3a1 0a
Yep. c3a1 is the correct byte pair for this UTF-8 character as stated in a previous comment.
Now confirm that the character is correct in a.js
$ xxd a.js
00000000: 6966 2028 3420 3d3d 204e 614e 2920 636f if (4 == NaN) co
00000010: 6e73 6f6c 652e 6c6f 6728 27c3 a127 293b nsole.log('..');
00000020: 0d0a ..
Yep.
Now run the compiler with the options as described in earlier comments and save its stderr output into err.out and use xxd to check the contents of that file.
$ java -jar $ccjar --charset=UTF8 --js a.js --js_output_file o.js 2> err.out
$ xxd err.out
00000000: 612e 6a73 3a31 3a34 3a20 5741 524e 494e a.js:1:4: WARNIN
00000010: 4720 2d20 5b4a 5343 5f53 5553 5049 4349 G - [JSC_SUSPICI
00000020: 4f55 535f 4e41 4e5d 2043 6f6d 7061 7269 OUS_NAN] Compari
00000030: 736f 6e20 6167 6169 6e73 7420 4e61 4e20 son against NaN
00000040: 6973 2061 6c77 6179 7320 6661 6c73 652e is always false.
00000050: 2044 6964 2079 6f75 206d 6561 6e20 6973 Did you mean is
00000060: 4e61 4e28 293f 0a20 2031 7c20 6966 2028 NaN()?. 1| if (
00000070: 3420 3d3d 204e 614e 2920 636f 6e73 6f6c 4 == NaN) consol
00000080: 652e 6c6f 6728 27c3 a127 293b 0d0a 2020 e.log('..');..
00000090: 2020 2020 2020 205e 5e5e 5e5e 5e5e 5e0a ^^^^^^^^.
000000a0: 0a30 2065 7272 6f72 2873 292c 2031 2077 .0 error(s), 1 w
000000b0: 6172 6e69 6e67 2873 290a arning(s).
Yep. We again see "c3" and "a1" used as the 2-byte encoding in bytes at positions 0x87 and 0x88.
The Java jar executing in Linux is definitely generating stderr using UTF-8 encoding.
Probably the closure-compiler you're running has been converted from a jar file to a native Windows binary using Graal, because I think that's what the google/closure-compiler-npm code that generates the NPM release tries to make the default.
I'm not sure if the different behavior you see is the result of Windows behavior or in the behavior of Java on Windows (as emulated by Graal), or something else.
One simplification/note to the bug test case is that the original a.py was
import subprocess
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)
although this bug does not relate to --charset=UTF8 parameter, and the bug occurs also with shorter line
import subprocess
subprocess.run(['npx', 'google-closure-compiler','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)
It is expected that the issue does not occur on Linux or macOS, since those OSes default to UTF-8 widely.
In my Windows shell I have changed my active codepage to UTF-8, i.e.
C:\emsdk\emscripten\main>chcp
Active code page: 65001
See chcp 65001.
Although this change does not affect the bug, so this is not a Windows terminal/console issue, but something somewhere in the libraries in question either in Closure or somewhere else like observed.
We successfully worked around this in Emscripten code by specifying a directive encoding='iso-8859-1' if WINDOWS else 'utf-8' when invoking Closure.