Perl-Dist-Strawberry
Perl-Dist-Strawberry copied to clipboard
Set SP's Active Code Page to UTF-8
It's currently impossible to pass ♠ as an argument to Perl on Windows. Similarly, it's currently impossible to perform files operations on arbitrarily-named files using builtins (e.g. open, stat, etc) This can be fixed for people with the May 2019 update of Windows.
First some background. Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le.[1] Perl uses the A version of all system calls. That includes the call to get the command line, and all the builtin functions such as open and stat.
The ACP is hard-coded. (Or maybe Windows asks for the system language during setup and bases it on that? I can't remember.) For example, it's 1252 on my system, and there's nothing I can do to change that. Notably, chcp has no effect on the ACP.
At least, that was the case until recently. The May 2019 update to Windows added the ability to change the ACP on a per-application basis via its manifest.
I would like to suggest setting the ACP of Strawberry Perl to 65001 (UTF-8) via its manifest as described in the link above.
This change is mostly backwards compatible. Win32::GetACP() and other means of obtaining the code page will continue to return the correct code. Where it was returned 1252, 1251, or whatever, it will now return 65001. Only code that assumed a system used a specific code page will break.
SetFileApisToOEMcan be used to change the encoding used by someAsystem calls to the OEM CP.
Of note,
- This change would have no effect on STDIN, STDOUT and STDERR (although it would allow
use utf8::all;to be used). - This change would have no effect on programs that only deal with ASCII.
- This change would have no effect on programs that correctly use
Win32::GetACP()to determine the encoding of@ARGVand file names passed to builtins.
But,
- This change will have an effect on programs that assumes the ACP is a specific value other than 65001.
- This change will have an effect on programs that create or read files encoded using the system ACP.
@steve-m-hay do you think it would be possible to add an option of setting ACP via perl.exe's manifest into perl core (5.34?) or do you see any drawbacks?
@ikegami You say that there's nothing you can do about your ACP being 1252 and that chcp doesn't affect it. Did you realize that the mode command can change it? E.g. On my system the ACP is 1252 and the OEMCP is 850 (both English (UK)), but I can change them for the current Command Prompt like this:
D:\Temp>chcp
Active code page: 850
D:\Temp>mode con cp select=65001
Status for device CON:
----------------------
Lines: 9001
Columns: 80
Keyboard rate: 31
Keyboard delay: 1
Code page: 65001
D:\Temp>chcp
Active code page: 65001
I have made use of this in Perl scripts in the past in order to get a listing of arbitrarily-named files in a directory. With Win32API::File is it also possible to open such files, e.g. try running the attached listunicodefile.pl in a directory with a file called "I ♥ perl.txt" alongside it. listunicodefiles.pl.txt
However, this is all a terrible pain, and doesn't help with your initial remark that it's not possible to pass ♥ as an argument. (Or at least, I haven't found a way yet!)
It would be cool if all these things just worked without all this hassle. I haven't tried it yet, but the manifest solution sounds very promising.
@kmx Assuming the idea works as intended then I don't see any harm in making this a build option. The default behaviour would be unchanged, of course, so the only danger of breakage would be for people who have explicitly made use of the option.
but I can change them for the current Command Prompt like this:
You had me excited. Unfortunately, you are mistaken.
>mode con cp select=65001
Status for device CON:
----------------------
Lines: 9000
Columns: 212
Keyboard rate: 31
Keyboard delay: 0
Code page: 65001
>perl -MWin32 -E"say Win32::GetACP()"
1252
>dir /b
♠.txt
>perl -e"$_=qq{\x{2660}.txt}; utf8::encode($_); open(my $fh, '<', $_) or die $!; say 'ok'"
No such file or directory at -e line 1.
>perl -MWin32::LongPath -E"$_=qq{\x{2660}.txt}; openL(\my $fh, '<', $_) or die $!; say 'ok'"
ok
It doesn't change the ACP. It just changes the console's code page.
>perl -MWin32 -E"say Win32::GetConsoleOutputCP()"
65001
With Win32API::File is it also possible to open such files
I`m aware of a multitude of workarounds (reaching the "W" system calls by a variety of means such as Win32::LongPath, Win32::Unicode::File, Win32API::File, FFI::Platypus and Win32::API, XS), but it would be nice if we didn't have to use one!
Changing the ACP would break the scripts that rely on the system 8-bit codepage. If that weren't the case we would've switched perl to unicode APIs a long time ago.
Also, I don't think it's strawberry perl's job to deal with this issue. The perl.exe manifest is maintained upstream, at the perl repository.
Changing the ACP would break the scripts that rely on the system 8-bit codepage.
It would not. I would continue to return file names encoded using the ACP as it currently does. It would break scripts that assume the code page is any specific code page (e.g. 1250 or 1252), but that was already mentioned.
If that weren't the case we would've switched perl to unicode APIs a long time ago.
It only became possible to do this in May 2019 update to Windows.
Also, I don't think it's strawberry perl's job to deal with this issue. The perl.exe manifest is maintained upstream, at the perl repository.
I see, but I imagine SP may want to provide a binary using the current manifest and a one with an updated manifest. Should that not be the case, pushing the change upstream would indeed be in order.
...but I'll will open a ticket with Perl too.
It only became possible to do this in May 2019 update to Windows.
The ActiveCodePage manifest property is indeed new, but switching to -W APIs would've accomplished almost the same thing. Almost, because it wouldn't have changed the return value of GetACP().
BTW, ANSI APIs have some downsides, even when ACP is set to 65001. According to the documentation they don't support long paths (although in practice they sometimes work) and some newer APIs are unicode-only.
It only became possible to do this in May 2019 update to Windows.
The
ActiveCodePagemanifest property is indeed new, but switching to -W APIs would've accomplished almost the same thing.
Not at all.
- This, unlike what I suggest, would break proper code.
- Even if you convert the encoding to UTF-8.
- Even if you return the file name a decoded string.
Almost, because it wouldn't have changed the return value of
GetACP().
And that's key to not breaking things.
BTW, ANSI APIs have some downsides, even when ACP is set to 65001. According to the documentation they don't support long paths
Indeed. I mentioned that in the post that lead you to this thread. But it's still a million times better that what we have right now.
(We could use the W versions of syscalls then convert to the ACP to said benefits, but that's a later thing that strongly benefits from this change.)
(although in practice they sometimes work)
I didn't know that, or what that means.
From the corner where I sit, this is a breaking change.
I maintain a module Win32::SqlServer to permit Perl scripts to talk to SQL Server through OLE DB. I'm about to release a new version, and I have been doing some work to ensure that the module works correctly when the ANSI code page is 65001. However, for this to happen, the following must be true:
- You are connecting to SQL 2019 or later.
- You are using version 18.5 or later of the MSOLEDBSQL provider. And these restrictions are not due to my module, but they are limitations in the software that I use.
And I suspect that this is not the only XS extension that could run into trouble if StrawberryPerl would be running with UTF-8 as its ANSI code page.
I also like to point out that for older versions of Windows where this manifest thing is not understood, the behaviour would be different. That is, a Perl script would behave different on different OS versions.
If this is to be done, it must be an optional thing. Either by a separate install like there is for USE_64_BIT_INT, or something you can configure.
All this said, I have sympathy for the root problem. I wanted to write a Perl script for doing something with my music collection which includes Latin, Cyrillic, Japanese and Korean characters. I ended up writing C# programs instead.
If changing the code page of perl.exe to UTF-8 risks breaking existing code, here's an alternative suggestion. Would you consider adding a new Perl executable to the distribution that has a manifest which sets the code page to UTF-8? This extra file (perhaps named uperl.exe) would be analogous to wperl.exe and would not affect existing code at all.
It would be a huge benefit for those of us who are distributing apps that run on Windows/Mac/Linux to have a Perl that behaves identically with respect to I/O across those three platforms. And including it with SP would mean that we would not have to find a Windows box with the SDK installed and then run mt.exe manually to add the manifest to perl.exe for every SP update, which is a real hassle for those of us who package our code on an OS other than Windows.