lib/locale.t: Fix#21697
This was caused by two failures; the two commits fix one each.
It does indeed fix #21697. LGTM.
Shouldn't the use of _wsetlocale() prevent the the problem you describe in the second commit?
I'm lost in the macros, but I thought on Windows we always treated the locale name as UTF-8 (via S_wrap_wsetlocale()) and ignore the current code page.
I don't know if @tonycoz meant to imply that this is a bug, and the commit is just a work around for it. But it got me to thinking that this is the case.
So I used gdb to examine what's going on, and it appears to be in _wsetlocale() or perhaps our misunderstanding of it; the documentation may or may not be defective.
In locale.c, S_wrap_wsetlocale() is getting a UTF-8 encoded string containing the sequence \xC3\xBC This is U+FC, an umlauted u which replaces the ASCII u in "Turkish". A function is called to turn that into a wchar_t string. That string looks like
T 00 u 00 r 00 k 00 i 00 s 00 h 00 _ 00 T 00 FC 00 r 00 k 00 i 00 y 00 e 00 . 00 1 00 2 00 5 00 4 00
_wsetlocale sets errno to 42 and returns NULL. The MS errno page says 42 means
EILSEQ Illegal sequence of bytes (for example, in an MBCS string).
I don't understand why
I don't know if @tonycoz meant to imply that this is a bug, and the commit is just a work around for it. But it got me to thinking that this is the case.
That's what I was thinking, whether that's a Perl or _wsetlocale() bug I don't know.
In locale.c,
S_wrap_wsetlocale()is getting a UTF-8 encoded string containing the sequence\xC3\xBCThis is U+FC, an umlauteduwhich replaces the ASCIIuin "Turkish". A function is called to turn that into awchar_tstring. That string looks likeT 00 u 00 r 00 k 00 i 00 s 00 h 00 _ 00 T 00 FC 00 r 00 k 00 i 00 y 00 e 00 . 00 1 00 2 00 5 00 4 00
_wsetlocalesets errno to 42 and returns NULL. The MS errno page says 42 means
EILSEQ Illegal sequence of bytes (for example, in an MBCS string).I don't understand why
setlocale()/_wsetlocae() are documented to not set errno when the locale name string is invalid, so I suspect the EILSEQ is noise, since my own testing below showed errno=0 for failures that appear to be on the locale name string.
I tried the following:
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
//wchar_t *p =_wsetlocale(LC_ALL, L"Turkish_T\xFCrkiye.1254");
//wchar_t *p =_wsetlocale(LC_ALL, L".ACP");
//wchar_t *p =_wsetlocale(LC_ALL, L"Turkish_Turkiye.1254");
wchar_t *p =_wsetlocale(LC_ALL, L"T\xFCrkish_T\xFCrkiye.1254");
if (!p) {
perror("wsetlocale");
exit(1);
}
_putws(p);
}
Except for the .ACP case they all failed with "wsetlocale: No error" (tested with MSVC 2022, gcc-msvcrt 13.2.1)
I used godbolt.org to check whether the generated wide strings were what I expected (a UTF-8 encoded source file with a literal ü did not produce good results.)
I don't think that locale is supported on MSVC. Instead, I tested your program on MingW, modified as follows
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
wchar_t *q =_wsetlocale(LC_ALL, L"Turkish_T\xFCrkiye.1254");
if (!q) {
perror("wsetlocale");
exit(1);
}
_putws(q);
wchar_t *r =_wsetlocale(LC_ALL, L"Thai_Thailand.874");
if (!r) {
perror("wsetlocale");
exit(1);
}
_putws(r);
wchar_t *p =_wsetlocale(LC_ALL, L"Turkish_T\xFCrkiye.1254");
if (!p) {
perror("wsetlocale");
exit(1);
}
_putws(p);
}
And got the following:
Turkish_Türkiye.1254
Thai_Thailand.874
wsetlocale: Illegal byte sequence
so the bug is in _wsetlocale and the uestion becomes what are we going to do about it. The bug is there, and all the remaining patch does is work around it in our test file. Some Perl program may come along and innocently happen to switch from Thai to Turkish and hit this bug. Note that this is a MingW not built with UCRT. This Configuration should be going away soon.
Does the proposed removal of the %setlocale_failed part of the condition weaken the tests on the non-problematic systems ?
If so, the %setlocale_failed could instead be replaced with (%setlocale_failed && $Config{libc} ne '-lmsvcrt').
Similarly, I take it that the newly inserted setlocale(&POSIX::LC_ALL, "C"); call really only needs to be made if $Config{libc} eq '-lmsvcrt'.
I don't know if there's anything to be gained by introducing any of that extra clutter.
I ran @khwilliamson's C script on various 32-bit and 64-bit mingw-w64 compilers, and also on 32-bit and 64-bit VS 2022.
The only time I saw the wsetlocale: Illegal byte sequence complaint was on the MSVCRT mingw-w64 compilers.
Otherwise, it performed correctly.
Frustratingly I get different results:
C:\Users\Tony\dev\perl\git>type wsetlocale2.c
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
wchar_t *q =_wsetlocale(LC_ALL, L"Turkish_T\xFCrkiye.1254");
if (!q) {
perror("wsetlocale");
exit(1);
}
_putws(q);
wchar_t *r =_wsetlocale(LC_ALL, L"Thai_Thailand.874");
if (!r) {
perror("wsetlocale");
exit(1);
}
_putws(r);
wchar_t *p =_wsetlocale(LC_ALL, L"Turkish_T\xFCrkiye.1254");
if (!p) {
perror("wsetlocale");
exit(1);
}
_putws(p);
}
C:\Users\Tony\dev\perl\git>gcc --version | find "gcc"
gcc (MinGW-W64 x86_64-msvcrt-posix-seh, built by Brecht Sanders) 13.1.0
C:\Users\Tony\dev\perl\git>gcc -owsetlocale2.exe wsetlocale2.c
C:\Users\Tony\dev\perl\git>.\wsetlocale2
wsetlocale: No error
C:\Users\Tony\dev\perl\git>ver
Microsoft Windows [Version 10.0.19045.3693]
dumpbin /imports shows the executable is linked to msvcrt.dll.
I was hoping we could setlocale(LC_ALL, "C"); to get consistent encoding handling there, but it's not working for me by default.
Frustratingly I get different results
For me, the Windows version is Microsoft Windows [Version 10.0.22621.2861], and that's the only difference I can see between our 2 environments.
I inserted printf("%s\n%x\n", __MINGW64_VERSION_STR, _WIN32_WINNT); into the script, in case either of those might be of some relevance.
It showed that _WIN32_WINNT is set to 0x0601, and the mingw runtime is 11.0.0.
@khwilliamson @sisyphus @tonycoz : Can we get an update on the status of this ticket?
https://github.com/Perl/perl5/issues/21697 -- which is referred to in this ticket's Subject line and commit message -- was merged Dec 7 2023, but the discussion in this ticket carried onward. If this ticket is not yet closable, we should at least provide a better subject line for it. Thanks.
@khwilliamson @sisyphus @tonycoz : Can we get an update on the status of this ticket?
#21697 -- which is referred to in this ticket's Subject line and commit message -- was merged Dec 7 2023, but the discussion in this ticket carried onward. If this ticket is not yet closable, we should at least provide a better subject line for it. Thanks.
Having heard no reason to keep this ticket open, I'm closing it now. Please open a new ticket if you have further problems with this test file.
Thank you very much.