gettext-iconv-windows icon indicating copy to clipboard operation
gettext-iconv-windows copied to clipboard

Help messages are not displayed correctly

Open maznobu opened this issue 7 months ago • 30 comments

Hello.

This is my first time using gettext.

I encountered what appears to be a bug in the Windows Terminal environment on Japanese Windows 11.

I'm using a version that doesn't exhibit the bug, so I'm not in a rush to report it here.

  • Execution environment

    • OS: Japanese Windows 11 Pro for Workstations
      • This is not a multilingual version of Windows
      • The default encoding for Japanese Windows is Shift_jis (iso-2022-jp).
  • Version: 25H2 (build#. 26220.5770)

  • Execution terminal

    • Windows Terminal:
      • PowerShell 7.5.2
      • cmd.exe, Version 10.0.26220.5770
  • Command executed

    • msgfmt --help
    • The same applies to other commands with "--help".
  • Issue

    • The help message is garbled.
  • Versions that do not exhibit the issue:

    • gettext 0.21 + iconv 1.16 (released on August 12, 2020)
  • Versions that exhibit garbled characters:

    • gettext0.26-iconv1.17-static-64
    • gettext0.25.1-iconv1.17-static-64
    • gettext0.25-iconv1.17-static-64
    • gettext0.23-iconv1.17-static-64
    • gettext0.22.5a-iconv1.17-static-64
    • Packages other than those listed above have not been tested.

Other things I've tried:

  • In PowerShell, set $OutputEncoding to the encoding specified in *2 and display the help message.
  • In cmd.exe, set the chcp command to the encoding specified in *2 and display the help message.
    • *2 Encoding:
      • Shift_JIS (932) - Default for Japanese Windows
      • UTF-8 (65001)
      • EUC-JP (51932)
      • Latin-1 (850)
      • ASCII (20127)
  • Environment variables LANG=utf8.ja_JP, LC_ALL=utf8.ja_JP
  • Note that the file output by "gettext --help > help.txt" displays correctly in Shift-JIS.

maznobu avatar Sep 12 '25 05:09 maznobu

I'm not the right person to ask for this kind of technical support: I simply compile gettext for Windows following the authors' instructions.

I think @bhaible would be the right person, or you can file a bug at https://savannah.gnu.org/bugs/?group=gettext or write to the bug-gettext mailing list.

mlocati avatar Sep 12 '25 07:09 mlocati

Let's continue to discuss it here, for simplicity.

To understand what's going on, please attach three things:

  • The file help.txt produced by gettext --help > help.txt,
  • A screenshot of running gettext --help in a cmd.exe window,
  • A screenshot of running gettext --help in a PowerShell window.

bhaible avatar Sep 12 '25 09:09 bhaible

Thanks for the reply bhaible. The requested result is as follows:

  • The file generated by ”gettext --help > help.txt” : help.txt
    • The above help.txt will be displayed correctly if opened with the 'shift_jis' encoding. Screenshot opened in Visual Studio Code:: Image
  • A screenshot of running gettext --help in a cmd.exe window: Image

  • A screenshot of running gettext --help in a PowerShell window: Image

maznobu avatar Sep 13 '25 06:09 maznobu

Your cmd.exe screenshot looks like CP932 output, displayed in a console that assumes the an 8-bit encoding consisting of ISO646-US (= US-ASCII) and Katakana.

Your PowerShell screenshot looks like CP932 output, displayed in a console that assumes the JISX0201-1976 encoding (consisting of ISO646-JP and Katakana).

These two encodings are not useful for Japanese, since they cannot display Hiragana nor Hanzi characters.

Therefore this is what you need to change: the encoding used by cmd.exe and the encoding used by PowerShell. Set both to CP932, and you're done.

bhaible avatar Sep 13 '25 06:09 bhaible

In other words, you need to set the code page to 932, not 201.

bhaible avatar Sep 13 '25 06:09 bhaible

As I mentioned at the beginning, I am not using a multilingual version of Windows, but the native Japanese version of Windows 11 distributed for Japan. In this edition, the default encoding is Japanese (shift_jis) and the code page is 932, but I have not changed it at all.

  • Continuing from the previous cmd.exe window, here's the result of checking the code page: Image

  • Below is a screenshot of the code page check and "gettext --help" in a PowerShell window: Image

  • Below is a different approach. I have an older version in a different folder, and it displays Japanese correctly. The code page has not been changed at all during this time. In other words, even though the code page is the same, the older version displays correctly, while the latest version displays garbled text. Image

maznobu avatar Sep 13 '25 07:09 maznobu

When I asked "google gemini," they said, "Perhaps the libraries or something have been changed in the new version, causing it to fail to process Japanese."

I'm Japanese and have only checked the Japanese version, but China and Korea also use multi-byte code, so the same problem may be occurring in those countries. However, people in China and Korea take pride in being able to read and write English, so it's possible that they don't use a version of Windows in their native language. Japanese people don't have such preferences, so most Japanese people probably use the version of Windows in their native language.

If possible, I would like to be able to use gettext just by obtaining the binary code, without having to prepare the recompilation environment myself.

So for now I'm going to stick with an older version that works fine and hope someone will fix it eventually.

maznobu avatar Sep 13 '25 08:09 maznobu

I admit that your last round of screenshots is puzzling: One program's output (CP932) is displayed correctly, another program's output (also CP932) is displayed as if it were JISX0201.

@mlocati: Can you please look for differences between gettext.exe in one package and gettext.exe in the other package? Possible differences I can think of:

  • Imported libraries (dumpbin /imports or equivalent),
  • The meta information in the Properties window of the Windows file manager.

bhaible avatar Sep 13 '25 08:09 bhaible

I'd like to report on my current progress.

The source structure of each package (gettext-0.21 or gettext-0.22.5) is as follows:

directory issue
1. src/gettext-runtime/po gettext po/gmo resources
2. src/gettext-runtime/gnu-lib gettext libraries
3. src/gettext-tools/runtime/po po/gmo resources for other commands such as msgfmt
4. src/gettext-tools/gnu-lib libraries for other commands such as msgfmt

Multibyte handling in printf() functions called by commands such as gettext and msgfmt is written using different codes for gettext and non-gettext commands such as msgfmt, i.e. gettext and other commands are independent.

Of the above, regarding the po files, while all other po files are written in utf-8, only the Japanese resource file, ja.po, is written in euc-jp. This may be the cause. However, when running "msgfmt --help", the display is correct in gettext-0.21-iconv-1.16, but garbled in gettext-0.22.5a-iconv-1.17. At this point, I can only say that these differences have nothing to do with the ja.po file. However, the content displayed is probably the content of these jo.po files. It may be that the older package only reads Japanese in euc-jp, or that this processing has been eliminated in the newer package.

That's all, and I'll report back to you.

maznobu avatar Sep 13 '25 14:09 maznobu

Ah, I forgot to add some additional information.

gettext-0.21-iconv-1.16, which displays the help message correctly, does not include the gettext.exe binary. This may be because it was simply not included in the build.

On the other hand, the package where the help message is garbled (gettext-0.22.5a-iconv-1.17 and later) does include a gettext.exe binary image.

These two packages (gettext-0.21-iconv-1.16 and gettext-0.22.5a-iconv-1.17) do not contain source code, but rather configuration information and patches required for building.

gettext-0.21-iconv-1.16 included numerous patches and .vbs and .sh files. On the other hand, gettext-0.22.5a-iconv-1.17 and later do not include any patches, and instead include .ps1 (powershell) scripts. Although I haven't examined the details in detail, there are significant differences in the build environment to begin with.

Are these .vbs, .sh, and .ps1 tools created by mlocati ?

That's all, and I'll report back to you.

thank you.

maznobu avatar Sep 13 '25 14:09 maznobu

This "gettext-iconv-windows" repository doesn't contain the gettext source code: here we have "just" some scripts that I wrote to build gettext and iconv for Windows.

Those scripts are invoked by the build.yml GitHub Action - See here for the executions of that action.

mlocati avatar Sep 13 '25 16:09 mlocati

Is it possible to build on github using AWS? That's amazing! I didn't know that.

It's been eight years since I retired from the IT industry. The world has changed so fundamentally without me noticing! It's hard for an old man like me to understand.

maznobu avatar Sep 14 '25 00:09 maznobu

Thank you for your help. After some trial and error, this problem was solved.

I will explain what happened. First, navigate to the bin directory of the problematic package from pwsh.

cd cd $env:USERPROFILE\Downloads\gettext0.26-iconv1.17-static-64\bin

Next, run the following command:

.\gettext.exe --help | Out-File -FilePath help.txt

The output to help.txt is in utf-8.

Image

In other words, we can see that gettext --help is displaying the help message in utf-8.

Therefore, if utf-8 could be displayed on Windows, there would be no problem, but the default system settings for Japanese Windows only allow the shift_jis and euc-jp code pages, and do not support utf-8. This means that the following command is not available by default:

chcp 65001

Therefore, in order to make utf-8 usable in the Windows shell (powershell or cmd.exe), we will use a Windows beta feature.

First, Go to "Time and Language," "Language and Region," and "Administrative Language Settings" in Windows Settings.

Image Image

Next, Click the "Change system locale" button on the "Administrative" tab.

Image

"Japanese (Japan)" will be displayed under "Current system locale."

Image

Under that item, there is an option that says "Beta: Use Unicode UTF-8 for worldwide language support (U)", so check this option and restart the system.

With this setting, as long as UTF-8 is output to standard output, kanji will be displayed correctly without having to switch the chcp command.

Below is the result. Image

Thank you for all the advice above. Thank you everyone.

maznobu avatar Sep 14 '25 04:09 maznobu

Thanks @maznobu for your investigations. It's good to hear that for you, turning on the "Beta" UTF-8 mode of your Windows installation is a workaround. But I want to understand the cause and find a fix.

  1. I reproduce the issue, simply by switching my Windows 10 installation to Japanese (and rebooting, of course). To understand the dialogs, I let translate.google.com translate screenshots for me.

I'm focusing on cmd.exe (since it's less complex than PowerShell).

With the gettext0.21-iconv1.16-static-64 binaries, the output of msgfmt --help looks like this (OK): Image

With the gettext0.26-iconv1.17-static-64 binaries, the output of msgfmt --help looks like this (BUG): Image

Note in particular the "write error" message in the last line. This comes from the program. Therefore it proves that the cause lies in the program, not in the console.

  1. I compared the dumpbin /imports output of the two programs:

0.21-msgfmt-imports.txt.gz

0.26-msgfmt-imports.txt.gz

Both use msvcrt.dll. This proves that Microsoft's ucrt is not the cause.

  1. I compared the meta information of the two msgfmt.exe programs, in Windows explorer. Aside from a signing from "SignPath Foundation", I could not see a relevant difference.

  2. I asked ChatGPT: "A C program running on Windows (that uses printf) produces correct output when compiled with an older version of MSVCRT, but with a newer version of MSVCRT it somehow converts the output, with the effects that 1) the cmd.exe does not display it correctly, 2) the error indicator on the stdout stream gets set. What can be the cause?"

The answer sounds plausible, but — as usual when one asks ChatGPT a very special question — it's a hallucination: it doesn't hold up to factual verification.

  1. I compiled gettext-0.26/gettext-runtime in my usual Cygwin environment, once with mingw 5.0, once with MSVC 14, both with options --enable-relocatable --disable-shared. Then, when in the cmd.exe console window, I set set LANG=Japanese_Japan.932 and run .\gettext.exe --help:
  • With the mingw 5.0 binary, I see the BUG.
  • With the MSVC binary, it's OK.

So, I'm now focusing on differences between the runtime libraries (mingw + msvcrt vs. UCRT). More to come...

bhaible avatar Sep 14 '25 22:09 bhaible

when running "msgfmt --help", the display is correct in gettext-0.21-iconv-1.16, but garbled in gettext-0.22.5a-iconv-1.17.

I confirm:

  • gettext-0.21-iconv-1.16 is OK,
  • gettext-0.22.5a-iconv-1.17 (r1) and subsequent releases all show the BUG.

@mlocati : What were the differences regarding the use of mingw between these two releases that you built?

  • version of mingw?
  • use of __USE_MINGW_ANSI_STDIO ?

bhaible avatar Sep 14 '25 22:09 bhaible

I'm very sorry. I haven't done any builds and I don't have a build environment.

What I did was:

  1. Download the binary packages for each version provided by mlocati.
  2. Extract the files to the Downloads folder in my Windows personal profile.
  3. Launch PowerShell by selecting "Open Windows Terminal" from the Explorer context menu.
  4. From the launched PowerShell, launch the command interpreter with "cmd".
  5. Navigate to the extracted bin directory.
  6. Run "msgfmt --help".
  7. Repeat steps 3-6 for each version and check the results.

maznobu avatar Sep 14 '25 23:09 maznobu

P.S. While searching for UTF-8 extensions in Windows, I analyzed the source code and found that the printf() function, called by gettext and msgfmt, calls functions defined in gnu-lib within the same project. (Since gettext and msgfmt are independent, they appear to have their own gnu-lib libraries.) Internally, the function uses the Windows API function _wsetlocale to obtain locale information through multiple call levels. If locale information cannot be obtained, the function obtains the setting of the LC_ALL or LANG environment variable, separates the character set and locale with a period, and obtains the result.

I suspect that somewhere, if the character set does not exist or if the format returned by _wsetlocale is misinterpreted, it is internally fixed to "utf-8."

Is the relevant part somewhere around ctype_codeset()?

maznobu avatar Sep 15 '25 00:09 maznobu

@maznobu No need to be sorry. I am very grateful that you reported this issue, since it likely has a large impact. No one expects you to investigate the issue; that is what I (as the GNU gettext maintainer) and @mlocati need to do.

bhaible avatar Sep 15 '25 01:09 bhaible

No, I understand that very well.

The first answer was:

This means you need to set the code page to 932 instead of 201.

Despite initially explaining the situation in detail, this response made me feel slighted. So I decided to check the source code myself.

It's not the best analogy, but there are many cases, like Windows KB5063878, where many people are experiencing bugs but the manufacturer refuses to acknowledge them. In the case of KB5063878, the environment tested by the manufacturer is running the latest firmware, and the issue does not occur. However, the issues users face appear to occur on HDDs/SSDs that still have older, non-latest firmware written on them. It appears that manufacturers are not paying any attention to this issue.

So I'm truly grateful to everyone who has taken my concerns seriously in forums like this.


Returning to the topic at hand, once you've used the UTF-8 extensions option in the Windows, it seems that turning the setting off doesn't completely revert to its original state.

Even with UTF-8 extensions turned off, the 65001 code page setting remains. Previously, using "chcp 65001" would result in an error, but it can now be switched to.

Currently, UTF-8 extensions are turned off, and the results are clearly different from when I started this report, as shown below.

  1. Execution of "msgfmt --help" after executing "chcp 932" in PowerShell Japanese characters are displayed correctly, with no garbled characters. Previously, garbled characters would appear as in 3.
Image
  1. Execution of "msgfmt --help" after executing "chcp 65001" in PowerShell Character garbling is different from before.
Image
  1. Execution of "msgfmt --help" after executing "chcp 932" in cmd.exe The same garbled characters occur as before.
Image
  1. Execution of "msgfmt --help" after executing "chcp 65001" in cmd.exe The English help message will be displayed.
Image

That concludes my report on the current situation. Thank you for your continued support.

Thank you.

maznobu avatar Sep 15 '25 02:09 maznobu

The English help message will be displayed.

The LANG (or LC_ALL) variable matters for whether the translation can be found. To be on the safe side:

  • Use set LANG=Japanese_Japan.932 in environments with code page 932,
  • Use set LANG=Japanese_Japan.65001 in environments with code page 65001.

bhaible avatar Sep 15 '25 05:09 bhaible

@mlocati : What were the differences regarding the use of mingw between these two releases that you built?

The Windows binaries for gettext up to 0.21 were built using a Docker image with:

(see the "setup" script and the "compile" script)

For the newer gettext versions I switched to the official cygwin approach, without specifying -D__USE_MINGW_ANSI_STDIO=0 because gettext should already do that (see commit 45500ab1765581d6a3b7d2e6a6c2595466de70af).

mlocati avatar Sep 15 '25 07:09 mlocati

For the newer gettext versions I switched to the official cygwin approach, without specifying -D__USE_MINGW_ANSI_STDIO=0 because gettext should already do that (see commit 45500ab1765581d6a3b7d2e6a6c2595466de70af).

This commit disables the mingw stdio functions only in the three libraries. The rest of the binaries are built with __USE_MINGW_ANSI_STDIO being 1, due to gnulib/m4/stdio_h.m4.

Let me see whether this flag is relevant for the issue...

bhaible avatar Sep 15 '25 08:09 bhaible

Here is a small reproducer, independent of Gnulib.

gettext-usage.c

I constructed this program by starting with gettext-0.26/gettext-runtime/src/gettext.c, replacing the gettext invocations with the string literals from the Japanese localization (in CP932 encoding), reducing the use of Gnulib step by step, and finally replacing two printf invocations with fputs invocations.

This program, when compiled with mingw 13 / msvcrt, and run in a Windows 10 system set to Japanese, in a cmd.exe window with chcp 932 and set LANG=Japanese_Japan.932, exhibits the following behaviour:

  • When compiled with -D__USE_MINGW_ANSI_STDIO=0, the output is correct: it uses double-width characters consistently.
  • When compiled with -D__USE_MINGW_ANSI_STDIO=1, the output of the fputs invocations (lines 1, 2, 6) is correct, whereas the output of the printf invocations is buggy (looks like interpreted in JISX0201 encoding):
Image

bhaible avatar Sep 15 '25 23:09 bhaible

Here's an explanation of the bug:

Windows consoles come with two encodings: GetACP() and GetOEMCP(). For Japanese, both have the same value (932). However, for English, German, French Windows installations, GETACP() = 1252 and GetOEMCP() = 850. For many years, output of non-ASCII characters to consoles was a PITA: While the program had to produce output in GetACP() encoding when writing to files, it had to produce output in GetOEMCP() encoding when writing to a console. The majority of programs did not do this: they produced output in GetACP() encoding always, and thus non-ASCII characters got garbled in consoles.

After many many years, Microsoft finally added a workaround in the C runtime library (msvcrt and ucrt). When a program writes a string to a console, the runtime library tests whether the output goes to a console, and if yes, it does a conversion from GetACP() encoding to GetOEMCP() encoding on the fly, in two steps: from GetACP() to UTF-16 via MultiByteToWideChar, then to GetOEMCP() via WideCharToMultiByte.

In the ucrt library, this conversion can be found in ucrt-10.0.22621.0/lowio/write.cpp. Look at the functions _write_nolock write_requires_double_translation_nolock write_double_translated_ansi_nolock

In the msvcrt library, a similar conversion takes place. This library is closed-source, but I spotted similar calls to MultiByteToWideChar and WideCharToMultiByte while single-stepping in the debugger. The BUG is here, in the msvcrt library, when the encoding is a double-byte encoding and the program produces output one byte at a time.

Now, all reasonable implementations of fputs, fprintf, etc. pass the output to the lower-level layers via a reasonably small number or fwrite calls. Only the mingw *printf functions don't do this. For instance, the __mingw_vfprintf function invokes __mingw_pformat, and this functions calls fputc once for each byte. Aside from being inefficient (of course; why does fwrite exist?!), it triggers the aforementioned bug in msvcrt. And this hasn't changed between mingw 5.0 (released in 2016) and mingw 13.0 (released in 2025).

The mingw *printf functions become active by defining __USE_MINGW_ANSI_STDIO to 1.

How do I arrive at this explanation?

Recent MSYS2 comes with a fully working gdb, that can display stack traces, and where step and stepi are working.

In such an MSYS2 environment, I built gettext-0.26/gettext-runtime with --enable-relocatable (so that the .mo files get found without filename hassles) and --disable-shared (to eliminate DLL hassles). I did so in three configuration: A. mingw 13 / msvcrt B. mingw 13 with __USE_MINGW_ANSI_STDIO=0 / msvcrt C. mingw 13 / ucrt

The bug is visible in configuration A, and things are OK in configurations B and C.

In configuration A, I saw a call stack

main
usage
rpl_printf
rpl_vfprintf
__mingw_vfprintf
__mingw_pformat
__pformat_putc
fputc
putc

and, from there, the following functions get invoked:

 msvcrt!_flsbuf
 msvcrt!_isatty
 msvcrt!_write
 msvcrt!_setmode
 KERNELBASE!GetConsoleMode
 KERNELBASE!GetConsoleCP
 KERNELBASE!GetConsoleScreenBufferInfoEx
 WriteConsoleW
 msvcrt!isleadbyte
 msvcrt!_errno
 msvcrt!.doserrno
 strerror_s

In configuration B, I saw a call stack

main
usage
rpl_printf
rpl_vfprintf
vfprintf

and, from there, the following functions get invoked:

 ungetwc
 msvcrt!_isatty
 msvcrt!_isleadbyte_l
 strerror_s
 msvcrt!_flsbuf
 msvcrt!_write
 msvcrt!_setmode
 KERNELBASE!GetConsoleMode
 KERNELBASE!GetConsoleCP
 KERNELBASE!GetConsoleScreenBufferInfoEx
 WriteConsoleW
 msvcrt!mbtowc
 msvcrt!_mbtowc_l
 MultiByteToWideChar
 KERNELBASE!GetCPHashNode
 WideCharToMultiByte
 WriteFile

In configuration C, I saw a call stack

main
usage
rpl_printf
rpl_vfprintf
__mingw_vfprintf
__mingw_pformat
__pformat_putc
fputc
ucrtbase!fputc

and, from there, the following functions get invoked:

 ucrtbase!_get_wpgmptr
 ucrtbase!_fputc_nolock
 ucrtbase!_write
 ucrtbase!_wfsopen
 ucrtbase!_isatty
 ucrtbase!___lc_locale_name_func
 KERNELBASE!GetConsoleMode
 KERNELBASE!GetConsoleOutputCP

It's the WriteConsoleW function, when invoked on individual bytes, that produces the effect of JISX0201 characters (configuration A). In configuration B, WriteConsoleW is used as well, but on strings composed of entire characters.

bhaible avatar Sep 16 '25 00:09 bhaible

Wow, what an in-depth analysis, @bhaible!

mlocati avatar Sep 16 '25 07:09 mlocati

Wow, what an in-depth analysis

Well, before doing this change in Gnulib, where it affects all programs built with Gnulib, I figured I better be really sure of what I'm saying. And it was confusing to see that the same __mingw_vfprintf function works perfectly fine with ucrt, but not with msvcrt.

bhaible avatar Sep 16 '25 07:09 bhaible

I've added two commits to Gnulib: https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00213.html https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00214.html

and verified that with them, .\gettext --help in a Japanese Windows environment, in a cmd.exe with code page 932, and with set LANG=Japanese_Japan.932 shows correctly.

This is a smaller-impact fix.

Setting __USE_MINGW_ANSI_STDIO to 0 is also a fix, but it has a larger impact, as it affects also functions like sprintf and thus needs more workarounds in Gnulib's vasnprintf module.

bhaible avatar Sep 17 '25 09:09 bhaible

Great!

I've added two commits to Gnulib: https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00213.html https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00214.html

Is the second link right? Or shoud it be https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00217.html instead?

mlocati avatar Sep 17 '25 10:09 mlocati

Is the second link right? Or should it be https://lists.gnu.org/archive/html/bug-gnulib/2025-09/msg00217.html instead?

I guessed the URL before the message went into the archive. Apparently I had no luck :)

bhaible avatar Sep 17 '25 10:09 bhaible