fix: enable UTF-8 console output for jbang.{cmd,ps1,sh} on Windows
jbang --help on Windows does not display Unicode characters.
but now it does.
Fixes https://github.com/jbangdev/jbang/issues/2350
Although this definitely works the issue I have with this is that the encoding will persist after JBang exits. So we have basically changed the console's encoding affecting any commands we run afterwards.
Now in itself that might not be bad, in many (most?) cases this might actually improve things for the user. But it would be somewhat weird to see (some) apps behaving differently before JBang was run vs after JBang was run.
Of course we could try resetting the code page back again before exiting. Or point users to documentation on how to change codepages on the system level. (Although I definitely like being able to show correct output regardless of the user's system settings)
100% agree with what you are saying. However, I think this PR is better approach than asking users to globally enable
because
-
Most will never do this.
-
If it is set globally then the risk of it breaking something is higher than localizing this setting to the specific Windows shell that JBang is running in.
I think it is better to enable this setting as the PR proposes, than not to have it enabled at all. It might break an ancient Windows command-line program that requires the default Windows code page. I don't think many of those are still in use by JBang command-line users.
Alternative CMD implementation.
REM Step 1 - save current code page
for /f "tokens=3" %%c in ('chcp') do set "ORIGINAL_CP=%%c"
REM Step 2 - set code page to UTF-8
chcp 65001 > NUL
REM Step n - restore original code page
chcp %ORIGINAL_CP% > nul
Something similar could be done for PowerShell. But I think this might be overkill.
CMD script code to only set the code page to 65001 if required:
setlocal
for /F "tokens=4" %%c in ('chcp') do (
if "%%c" NEQ "65001" (
chcp 65001 > nul
)
)
endlocal
Something similar could be done for PowerShell. But I think this might be overkill.
You might be right, I'll let @maxandersen decide
I do think we should clean up - it's not good behaviour to modify users environment. We do the env cleanup for other variables; especially on windows.
My main concerns are:
- does it affect the execution speed? I doubt it but windows can be surprising :)
- impact on users execution. Should we run the users app with this on or off? I'm leaning towards keeping it on as UTF-8 is just easier :) but could imagine we would need to offer flag to NOT apply it - so should we do it just for jbang exec part?
- should we do similar on other OS for consistent behaviour ?
does it affect the execution speed? I doubt it but windows can be surprising :)
Adds 7ms overhead on my 8-core dev machine
impact on users execution. Should we run the users app with this on or off?
It should be on, we want to have nice things on Windows too! :-)
set JBANG_APP_JAVA_OPTIONS=-Xmx1g
jbang run JvmRuntimeOpts.java
--- 🚀 JVM Runtime Options (VM Arguments) --- Option 1: -Xmx1g
--- 📝 Application Arguments --- No Application Arguments were passed to the main method.
--- ✅ Execution Complete ---
should we do similar on other OS for consistent behaviour ?
Not needed on Linux or MacOS
Fixed jbang.ps1 - it now restores the code page to its original value.
jbang,cmd recursively calling itself ...... oh no.
if "!binaryPath!"=="" if "!jarPath!"=="" (
if not exist "%JBDIR%\bin\jbang.jar" (
powershell -NoProfile -ExecutionPolicy Bypass -NonInteractive -Command "%~dp0jbang.ps1 version" > nul
if !ERRORLEVEL! NEQ 0 ( exit /b %ERRORLEVEL% )
)
call "%JBDIR%\bin\jbang.cmd" %*
exit /b %ERRORLEVEL%
)
I'm going to rename jbang.cmd to _jbang.cmd and then create a new jbang.cmd with the folloing code:
@echo off
rem Save current code page
for /f "tokens=2 delims=:" %%a in ('chcp') do set "_OriginalCP=%%a"
set "_OriginalCP=%_OriginalCP: =%"
rem Enable UTF-8 code page
chcp 65001 > nul
call _jbang.cmd %*
rem Restore original code page
chcp %_OriginalCP% > nul
exit /b <n> to the rescue.
Only one test case fails on Windows:
@Test
public void shouldHandleSpecialCharacters() {
assertThat(shell("jbang echo.java \" ~!@#$%^&*()-+\\:;\'`<>?/,.{}[]\"")).outIsExactly(
"0: ~!@#$%^&*()-+\\:;'`<>?/,.{}[]" + lineSeparator());
}
To get the special character test to pass, I reduced it to:
@Test
public void shouldHandleSpecialCharacters() {
assertThat(shell("jbang echo.java \" ~@#$&*()-+\\:;'`<>?/,.{}[]\"")).outIsExactly(
"0: ~@#$&*()-+\\:;'`<>?/,.{}[]" + lineSeparator());
}
Removed special CMD shell characters % and ! and ^
Removed special CMD shell characters % and ! and ^
I did additional testing with what is for CMD is the standard way to escape these three characters.
- ^! is translated to !
- %% is translated to %
- ^^ should be translated to ^
The first two actually works fine in an input string to JBang.cmd. However ^^ does not translate to ^, but to ^^^^.
why is unicode failing the escape code? are the special characters affected based on codepage? i know windows can be weird but that seems really weird. /cc @quintesse as he had the "most fun" with escape character handling.
It is fortunately not Unicode related. The string value of %* in call jbangx.cmd %* is modifed when passed to jbangx.cmd from jbang.cmd by CMD.exe. I confirmed this with this several echo debug statements in the CMD scripts.
Another approach would be to revert back to the previous implementation, and conditionally set the code page to 65001 using a new environment variable JBANG_CMD_NO_UTF8 . The code page will only be set to UTF-8 if JBANG_CMD_NO_UTF8 is not set to "1".
if not "%JBANG_CMD_NO_UTF8%" == "1" (
chcp 65001 > nul
)
Adding line
chcp.com 65001 > /dev/null # note the .COM extension
to ~/.bashrc for Git-Bash on Windows resolves the issue for Git-Bash as well.
@maxandersen , are you happy with the addition of the CMD specific environment variable JBANG_CMD_NO_UTF8 ?
It is fortunately not Unicode related. The string value of
%*incall jbangx.cmd %*is modifed when passed to jbangx.cmd from jbang.cmd by CMD.exe. I confirmed this with this several echo debug statements in the CMD scripts.
not following - what is jbangx.cmd in that sentence? and how is that triggered in this PR ? or are you talking about something only happening in your own testing?
also, googling around - this is only issue in cmd.exe ? windows terminal and other modern ones does not have this issue, correct?
if not "%JBANG_CMD_NO_UTF8%" == "1"
we used "true" in other places so lets stay consistent.
and shouldn't it be JBANG_WIN_NO_UTF8 ?
I'm done with coding and testing. It all works rather well with CMD, PowerShell and Git-Bash (without having to modify .bashrc).
It is fortunately not Unicode related. The string value of
%*incall jbangx.cmd %*is modifed when passed to jbangx.cmd from jbang.cmd by CMD.exe. I confirmed this with this several echo debug statements in the CMD scripts.not following - what is jbangx.cmd in that sentence? and how is that triggered in this PR ? or are you talking about something only happening in your own testing?
@wfouche still confused what this was about?
It is fortunately not Unicode related. The string value of
%*incall jbangx.cmd %*is modifed when passed to jbangx.cmd from jbang.cmd by CMD.exe. I confirmed this with this several echo debug statements in the CMD scripts.not following - what is jbangx.cmd in that sentence? and how is that triggered in this PR ? or are you talking about something only happening in your own testing?
@wfouche still confused what this was about?
I tried to add functionality to jbang.cmd to restore the code page at the end of the script, this was rather challenging so I decided that given that jbang.cmd calls itself somewhere in the code (which compliates the restore functionality given the weird langauge CMD script is), I would rename jbang.cmd to jbangx.cmd and then create a wrapper jbang.cmd script that saves the code page, change it to 65001, calls jbangx.cmd and when it returns restore the code page. This all worked perfectly fine except for special characters that were automatically processed by CMD Shell in the call from jbang.cmd to jbangx.cmd which I could not find a satisfactory fix for after trying really hard. All of those changes were reverted and the current implementation (three lines) created for CMD Shell.
It is best to apply the KISS principle to the three implementations - Keep It Super Simple.
The PR now has a small snippet of documentation as well.
This is starting to feel like a Don Quixote kind of effort.
In jbang.sh because it execs itself under some circumstances, it will not reach the end of the script to restore the code page.
if [ $err -eq 255 ]; then
eval "exec $output"
# should not reach here
err=$?
elif [ -n "$output" ]; then
echo "$output"
fi
return $err
This is starting to feel like a Don Quixote kind of effort.
Yeah, which is why I suggested to just forget about restoring the code page. But of course that could result in users getting confused if things start to work differently after the've run jbang.
Now, they can prevent that by setting the JBANG_WIN_NO_UTF8 variable, but then we're back again to the user having to do something to make things work. (And where do they set that variable permanently? Not all users know how to do that and perhaps do not want to go through that trouble)
Of course if we turn it around and have a JBANG_WIN_ENABLE_UTF8 then we're back to square one as well because who is going to set that instead of doing the "right" thing (= setting "Beta: Use Unicode UTF-8 for worldwide language support")?
So I'm conflicted. It would be nice to do this, but I fear that unless we can consistently reset the code page back to its original value we might be causing more issues than solving them. But @maxandersen is more of a risk taker than me, perhaps he feel differently and we simply forget about resetting the code page.
Of course if we turn it around and have a JBANG_WIN_ENABLE_UTF8
I've come to the same conclusion, and the latest round of changes support this way of looking at the UTF-8 world on Windows.
If you opt-in then you're happy for UTF-8 to remain enabled in the current shell, once JBang exits and is no longer running.
I've found that enabling the "Beta: Use Unicode UTF-8 for worldwide language support" is intrusive and it causes Jython to print warnings, so it might break other programs as well.
Who will actually set JBANG_WIN_ENABLE_UTF8 to true? If we draw attention to this setting in jbang --help or when just jbang is run, we could inform users of JBang on Windows that UTF8 support can be enabled.
On Windows using setx from the CMD, PowerShell or Bash command-line one can create a new environment variable and assign a value to it. The value is persistent across reboots.
setx JBANG_WIN_ENABLE_UTF8 true
OK, this is the best approach yet. :-)
JBang on Windows will now automatically enable UTF8 support if it is not enabled.
- If env-var JBANG_WIN_ENABLE_UTF8 is not defined:
- Use setx to create the env-var in user space
- Use set (CMD), export (Bash) or ?? (PowerShell) to create the env-var in process space.
- If env-var JBANG_WIN_ENABLE_UTF8 == "true"
- chcp 65001
A user that do not want have UTF8 support to be enabled, simply run setx to set the env-var to false and then restart the shell session:
- setx JBANG_WIN_ENABLE_UTF8 false
- Close shell, open a new Shell.
@quintesse, as a fellow Windows user, maybe you should be the first to have a look at this new approach. Support for CMD, PowerShell, and Git-Bash have been extensively tested.