perl5
perl5 copied to clipboard
When embedding Perl in C, the locale is switched to C/ASCII
Description
When embedding Perl in a C program in order to run Perl scripts in it, the locale is set to C/ANSI_X3.4-1968 (ASCII). This causes issues with using non-ascii characters.
This is a regression introduced in commit 7af2d2037375d58e700f9e1b217efb2c4db66133. Before this commit, the chosen locale was kept.
Steps to Reproduce
This script is one of the examples of embedding Perl taken from https://perldoc.perl.org/perlembed. The only changes are that it sets the locale first, and prints the current charset before and after loading Perl.
#include <EXTERN.h>
#include <locale.h>
#include <langinfo.h>
#include <perl.h>
static PerlInterpreter *my_perl;
int main(int argc, char **argv, char **env) {
char *current_charset = NULL;
setlocale(LC_ALL, "");
current_charset = strdup (nl_langinfo (CODESET));
printf("charset before loading perl: %s\n", current_charset);
PERL_SYS_INIT3(&argc, &argv, &env);
my_perl = perl_alloc();
perl_construct(my_perl);
PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
perl_parse(my_perl, NULL, argc, argv, (char **)NULL);
perl_run(my_perl);
current_charset = strdup (nl_langinfo (CODESET));
printf("charset after loading perl: %s\n", current_charset);
perl_destruct(my_perl);
perl_free(my_perl);
PERL_SYS_TERM();
exit(EXIT_SUCCESS);
}
When ran it prints:
charset before loading perl: UTF-8
charset after loading perl: ANSI_X3.4-1968
If run with Perl before commit 7af2d2037375d58e700f9e1b217efb2c4db66133, it prints UTF-8 in both lines.
After running PERL_SYS_TERM() the locale is back to UTF-8 again, but the documentation says that it should only be called once after freeing the last interpreter. My use case (Perl scripts for extending functionality in WeeChat) is having long running Perl scripts that often are kept running for the whole lifetime of the application, so this doesn't help.
Expected behavior
That the locale is preserved after loading Perl and running Perl code.
Perl configuration
Summary of my perl5 (revision 5 version 38 subversion 0) configuration:
Platform:
osname=linux
osvers=5.12.15-arch1-1
archname=x86_64-linux-thread-multi
uname='archlinux'
config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -g -ffile-prefix-map=/build/perl/src=/usr/src/debug/perl -flto=auto -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.38/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.38/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.38/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto'
hint=recommended
useposix=true
d_sigaction=define
useithreads=define
usemultiplicity=define
use64bitint=define
use64bitall=define
uselongdouble=undef
usemymalloc=n
default_inc_excludes_dot=define
Compiler:
cc='cc'
ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
optimize='-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -g -ffile-prefix-map=/build/perl/src=/usr/src/debug/perl -flto=auto'
cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
ccversion=''
gccversion='13.1.1 20230714'
gccosandvers=''
intsize=4
longsize=8
ptrsize=8
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=16
longdblkind=3
ivtype='long'
ivsize=8
nvtype='double'
nvsize=8
Off_t='off_t'
lseeksize=8
alignbytes=8
prototype=define
Linker and Libraries:
ld='cc'
ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -fstack-protector-strong -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
libc=/lib/../lib/libc.so.6
so=so
useshrplib=true
libperl=libperl.so
gnulibc_version='2.37'
Dynamic Linking:
dlsrc=dl_dlopen.xs
dlext=so
d_dlsymun=undef
ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.38/core_perl/CORE'
cccdlflags='-fPIC'
lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -L/usr/local/lib -fstack-protector-strong'
Characteristics of this binary (from libperl):
Compile-time options:
HAS_LONG_DOUBLE
HAS_STRTOLD
HAS_TIMES
MULTIPLICITY
PERLIO_LAYERS
PERL_COPY_ON_WRITE
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_SIPHASH13
PERL_HASH_USE_SBOX32
PERL_MALLOC_WRAP
PERL_OP_PARENT
PERL_PRESERVE_IVUV
PERL_USE_SAFE_PUTENV
USE_64_BIT_ALL
USE_64_BIT_INT
USE_ITHREADS
USE_LARGE_FILES
USE_LOCALE
USE_LOCALE_COLLATE
USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC
USE_LOCALE_TIME
USE_PERLIO
USE_PERL_ATOF
USE_REENTRANT_API
USE_THREAD_SAFE_LOCALE
Built under linux
Compiled at Jul 24 2023 22:17:30
@INC:
/usr/lib/perl5/5.38/site_perl
/usr/share/perl5/site_perl
/usr/lib/perl5/5.38/vendor_perl
/usr/share/perl5/vendor_perl
/usr/lib/perl5/5.38/core_perl
/usr/share/perl5/core_perl
There have been significant fixes to the locale initialization code recently. I was hoping that they would fix this issue, and indeed in trying it out on blead, I get:
charset before loading perl: UTF-8 charset after loading perl: UTF-8
Could you check if it is now fixed for you?
@trygveaa could you try blead on this problem to verify that it has been fixed or not?
Thanks! Yes, I checked now, and the issue is indeed resolved on blead.
Will there be a patch release on 5.38 for this, or will it only be fixed in the next standard release?
Thanks! Yes, I checked now, and the issue is indeed resolved on blead.
Will there be a patch release on 5.38 for this, or will it only be fixed in the next standard release?
In order for the corrections to appear in maintenance release perl-5.38.1, we have to be able to identify the commit(s) made since perl-5.38.0 that corrected the problem. In other words, what commit(s) undo the harmful effects of https://github.com/Perl/perl5/commit/7af2d2037375d58e700f9e1b217efb2c4db66133 in the last production cycle without undoing that commit's benefits. A reverse bisection, in effect.
Given that the defect shows up, not in just a regular Perl program, but when you embed Perl in a C program, this is non-trivial. (It's above my own pay grade.) @trygveaa, would it be possible for you to identify the monthly development release at which this problem cleared up for you? (E.g., v5.39.1 on July 20 or v5.39.2 on August 20).
@khwilliamson and @steve-m-hay, advice sought. Thanks.
@jkeenan: I bisected it and found that the commit that fixes the issue is bf38d1cf744fcc49b715b9d633761aa67436c002.
@jkeenan: I bisected it and found that the commit that fixes the issue is bf38d1c.
In the Perl 5 repository there is a branch called maint-votes in which committers propose commits for back-porting to maintenance releases. (The next such maint release would be perl-5.38.1.) In commit bfaeb305cb28e3bb1fa3b299fc2f7bfa42c84a05 in that branch, I have requested that bf38d1c be included in that maintenance release.
@steve-m-hay and @tonycoz, please double-check that commit, as I have not often touched the maint-votes branch.
Thanks!
please double-check that commit, as I have not often touched the maint-votes branch.
The maint-votes commit looks fine.
I believe this issue is currently breaking (the whole of, including C parts) irssi on fedora 39, first reported in https://github.com/irssi/scripts.irssi.org/issues/857 (basically the same use case as @trygveaa )
after discussion with @khwilliamson reverting https://github.com/Perl/perl5/commit/7af2d2037375d58e700f9e1b217efb2c4db66133 on top of 5.38.0 fixes this for me
after discussion with @khwilliamson reverting 7af2d20 on top of 5.38.0 fixes this for me
FYI we did that revert in Debian, but it caused another regression as seen in #22195 .
@khwilliamson a better maint-5.38 fix for this would be very much appreciated while you are considering things for 5.38.3 :)
I believe I have a solution to this problem that can be tried immediately, at least for glibc. I restored 7af2d20. I then compiled and ran the reproducing program that @trygveaa furnished (thank you very much). It failed, as expected. I then added the following to my normal Configure options:
-Accflags=-DUSE_NL_LOCALE_NAME
I then recompiled perl, and the test program, using the new perl library. And this time it passed.
This means this bug can be fixed by restoring a deleted patch and reconfiguring perl with a different option.
A root cause of this problem is a defect in the POSIX standard. The thread-safe locale functionality added in 2008 is inordinately clumsy and, importantly, is lacking the ability to know what the current locale is. Any linguist would tell you that that would not fly. My guess is that because the locale names originally in C were supposed to be opaque and mostly didn't up so, that they would just enforce that by not giving them names. But I don't really know if that's the reason. But I do know that they eventually realized it is a defect and the next POSIX standard will specify a function to find out this information. But in the many years in the meantime, implementers were forced to come up with something or other. FreeBSD implemented querlylocale() (but it's been buggy and not well thought out). glibc had to have for internal use a way to get the name. Again, it's my speculation, but I guess that they decided that since the Standard didn't have this, they shouldn't expose it, so they didn't document it, and hid it away in the nl_langinfo() function. At some point I discovered it, but wasn't confident to make using it the default, so that's what that Configure option does is to say use this function. It #ifdefs out the kludgy code I put in to handle the cases where there really is no way to get the locale name. That code is what is broken here. And instead uses the undocumented glibc feature. We enabled that feature by default early in the process of developing 5.40, and got no reported bugs in it, so 5.40 was shipped with it enabled.
Now, this may still be a problem for non-glibc platforms that don't have some querylocale-type functionality.
I apologize for not realizing this earlier. It looked like it would require a patch, so I waited until we actually were working on a 5.38 maintenance release
Restoring 7af2d20 likely means that #22195 automatically gets fixed.
My initial tests seem to confirm that adding -DUSE_NL_LOCALE_NAME is the better fix we were looking for, and that together with restoring 7af2d20 both known issues go away. @khwilliamson many thanks!
I'm not sure yet how wide testing I can manage inside Debian, as we have things just about ready for updating to 5.40 soon. We haven't made a stable release of Debian with Perl 5.38, so after the switch to 5.40 this issue will not affect our users anymore. I do have some regrets about how I handled this bug on the Debian side, so if there's still time to update and test the 5.38 Perl package inside Debian I'll try to make use of it and report back here.
However, as noted in #22195 Ubuntu did make a stable release with Perl 5.38 and the "bad fix" (=simple revert of 7af2d20) and their users will stay affected. I don't work on Ubuntu myself and I don't have a say about what they fix in their stable release. They just pull packages from Debian during their development cycle.
Unfortunately it turns out that -DUSE_NL_LOCALE_NAME changes the binary interface.
loadable library and perl binaries are mismatched (got first handshake key 0xf380080, needed 0xed00080)
The PerlInterpreter structure shrank from 3896 to 3792, dropping at least the cur_LC_ALL member.
This means that we can't adopt this for the Debian Perl 5.38 packages as-is without rebuilding all the XS module packages the same way we do with major Perl version transitions. That's not going to happen at this point, we're moving to 5.40 instead. I also doubt Ubuntu can/will use this fix for their stable release.
(I didn't look at how much patching it would require to keep the structs stable.)
I will experiment with making this easier for 5.38.
Try this patch PL_curlocales.txt
Thanks for the PL_curlocales patch. I verified that it fixes the ABI compatibility issue.
Unfortunately I didn't manage to upload this to Debian before the time window for 5.40 opened. We've just started the 5.40 transition now. So I can't test the fix "for real" anymore. Sorry that didn't work out.
To summarize, to my limited tests indicated that adding -DUSE_NL_LOCALE_NAME to ccflags with the ABI compat patch is a valid solution for this issue.
Thanks again for your work.
@steve-m-hay How should I proceed with this for 5.38 maintenance? I'm thinking I should formalize the patch at https://github.com/user-attachments/files/17266836/PL_curlocales.txt and add a bit that changes the default in Configure to -DUSE_NL_LOCALE_NAME
Apologies for not replying to this sooner. I see that you formalized your patch in PR#22701, but that now appears to be superseded by PR#22792? The main commit in that PR (ecd3fa8) looks reasonable for backporting to 5.38, but is there a reason why jkeenan's original proposal to backport bf38d1c from blead wouldn't work? I had previously not been keen on that since I mistakenly thought that backporting that in Debian was the cause of GH#22195, but upon re-reading the thread above I now see that is not what happened: It was actually their reversal of 7af2d20 which caused that problem.
Just a note that I lightly tested perl-5.38.3-RC1 in a Ubuntu 24.10 chroot, and I can confirm that this issue looks fixed now without introducing binary incompatibilities or other grief. Many thanks @khwilliamson and @steve-m-hay for finding a clean solution!
This is fixed with the release of 5.38.3