perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

When embedding Perl in C, the locale is switched to C/ASCII

Open trygveaa opened this issue 2 years ago • 19 comments
trafficstars

Description

When embedding Perl in a C program in order to run Perl scripts in it, the locale is set to C/ANSI_X3.4-1968 (ASCII). This causes issues with using non-ascii characters.

This is a regression introduced in commit 7af2d2037375d58e700f9e1b217efb2c4db66133. Before this commit, the chosen locale was kept.

Steps to Reproduce

This script is one of the examples of embedding Perl taken from https://perldoc.perl.org/perlembed. The only changes are that it sets the locale first, and prints the current charset before and after loading Perl.

#include <EXTERN.h>
#include <locale.h>
#include <langinfo.h>
#include <perl.h>

static PerlInterpreter *my_perl;

int main(int argc, char **argv, char **env) {
  char *current_charset = NULL;
  setlocale(LC_ALL, "");

  current_charset = strdup (nl_langinfo (CODESET));
  printf("charset before loading perl: %s\n", current_charset);

  PERL_SYS_INIT3(&argc, &argv, &env);
  my_perl = perl_alloc();
  perl_construct(my_perl);
  PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
  perl_parse(my_perl, NULL, argc, argv, (char **)NULL);
  perl_run(my_perl);

  current_charset = strdup (nl_langinfo (CODESET));
  printf("charset after loading perl: %s\n", current_charset);

  perl_destruct(my_perl);
  perl_free(my_perl);
  PERL_SYS_TERM();
  exit(EXIT_SUCCESS);
}

When ran it prints:

charset before loading perl: UTF-8
charset after loading perl: ANSI_X3.4-1968

If run with Perl before commit 7af2d2037375d58e700f9e1b217efb2c4db66133, it prints UTF-8 in both lines.

After running PERL_SYS_TERM() the locale is back to UTF-8 again, but the documentation says that it should only be called once after freeing the last interpreter. My use case (Perl scripts for extending functionality in WeeChat) is having long running Perl scripts that often are kept running for the whole lifetime of the application, so this doesn't help.

Expected behavior

That the locale is preserved after loading Perl and running Perl code.

Perl configuration

Summary of my perl5 (revision 5 version 38 subversion 0) configuration:
   
  Platform:
    osname=linux
    osvers=5.12.15-arch1-1
    archname=x86_64-linux-thread-multi
    uname='archlinux'
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -g -ffile-prefix-map=/build/perl/src=/usr/src/debug/perl -flto=auto -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.38/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.38/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.38/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -g -ffile-prefix-map=/build/perl/src=/usr/src/debug/perl -flto=auto'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='13.1.1 20230714'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib
    libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/../lib/libc.so.6
    so=so
    useshrplib=true
    libperl=libperl.so
    gnulibc_version='2.37'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.38/core_perl/CORE'
    cccdlflags='-fPIC'
    lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto=auto -L/usr/local/lib -fstack-protector-strong'


Characteristics of this binary (from libperl): 
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_STRTOLD
    HAS_TIMES
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_REENTRANT_API
    USE_THREAD_SAFE_LOCALE
  Built under linux
  Compiled at Jul 24 2023 22:17:30
  @INC:
    /usr/lib/perl5/5.38/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/5.38/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/5.38/core_perl
    /usr/share/perl5/core_perl

trygveaa avatar Aug 11 '23 22:08 trygveaa

There have been significant fixes to the locale initialization code recently. I was hoping that they would fix this issue, and indeed in trying it out on blead, I get:

charset before loading perl: UTF-8 charset after loading perl: UTF-8

Could you check if it is now fixed for you?

khwilliamson avatar Sep 06 '23 00:09 khwilliamson

@trygveaa could you try blead on this problem to verify that it has been fixed or not?

khwilliamson avatar Sep 09 '23 17:09 khwilliamson

Thanks! Yes, I checked now, and the issue is indeed resolved on blead.

Will there be a patch release on 5.38 for this, or will it only be fixed in the next standard release?

trygveaa avatar Sep 10 '23 22:09 trygveaa

Thanks! Yes, I checked now, and the issue is indeed resolved on blead.

Will there be a patch release on 5.38 for this, or will it only be fixed in the next standard release?

In order for the corrections to appear in maintenance release perl-5.38.1, we have to be able to identify the commit(s) made since perl-5.38.0 that corrected the problem. In other words, what commit(s) undo the harmful effects of https://github.com/Perl/perl5/commit/7af2d2037375d58e700f9e1b217efb2c4db66133 in the last production cycle without undoing that commit's benefits. A reverse bisection, in effect.

Given that the defect shows up, not in just a regular Perl program, but when you embed Perl in a C program, this is non-trivial. (It's above my own pay grade.) @trygveaa, would it be possible for you to identify the monthly development release at which this problem cleared up for you? (E.g., v5.39.1 on July 20 or v5.39.2 on August 20).

@khwilliamson and @steve-m-hay, advice sought. Thanks.

jkeenan avatar Sep 15 '23 17:09 jkeenan

@jkeenan: I bisected it and found that the commit that fixes the issue is bf38d1cf744fcc49b715b9d633761aa67436c002.

trygveaa avatar Sep 16 '23 09:09 trygveaa

@jkeenan: I bisected it and found that the commit that fixes the issue is bf38d1c.

In the Perl 5 repository there is a branch called maint-votes in which committers propose commits for back-porting to maintenance releases. (The next such maint release would be perl-5.38.1.) In commit bfaeb305cb28e3bb1fa3b299fc2f7bfa42c84a05 in that branch, I have requested that bf38d1c be included in that maintenance release.

@steve-m-hay and @tonycoz, please double-check that commit, as I have not often touched the maint-votes branch.

jkeenan avatar Sep 16 '23 12:09 jkeenan

Thanks!

trygveaa avatar Sep 16 '23 13:09 trygveaa

please double-check that commit, as I have not often touched the maint-votes branch.

The maint-votes commit looks fine.

tonycoz avatar Sep 16 '23 23:09 tonycoz

I believe this issue is currently breaking (the whole of, including C parts) irssi on fedora 39, first reported in https://github.com/irssi/scripts.irssi.org/issues/857 (basically the same use case as @trygveaa )

ailin-nemui avatar Sep 21 '23 10:09 ailin-nemui

after discussion with @khwilliamson reverting https://github.com/Perl/perl5/commit/7af2d2037375d58e700f9e1b217efb2c4db66133 on top of 5.38.0 fixes this for me

ailin-nemui avatar Sep 24 '23 12:09 ailin-nemui

after discussion with @khwilliamson reverting 7af2d20 on top of 5.38.0 fixes this for me

FYI we did that revert in Debian, but it caused another regression as seen in #22195 .

@khwilliamson a better maint-5.38 fix for this would be very much appreciated while you are considering things for 5.38.3 :)

ntyni avatar May 20 '24 18:05 ntyni

I believe I have a solution to this problem that can be tried immediately, at least for glibc. I restored 7af2d20. I then compiled and ran the reproducing program that @trygveaa furnished (thank you very much). It failed, as expected. I then added the following to my normal Configure options:

-Accflags=-DUSE_NL_LOCALE_NAME

I then recompiled perl, and the test program, using the new perl library. And this time it passed.

This means this bug can be fixed by restoring a deleted patch and reconfiguring perl with a different option.

A root cause of this problem is a defect in the POSIX standard. The thread-safe locale functionality added in 2008 is inordinately clumsy and, importantly, is lacking the ability to know what the current locale is. Any linguist would tell you that that would not fly. My guess is that because the locale names originally in C were supposed to be opaque and mostly didn't up so, that they would just enforce that by not giving them names. But I don't really know if that's the reason. But I do know that they eventually realized it is a defect and the next POSIX standard will specify a function to find out this information. But in the many years in the meantime, implementers were forced to come up with something or other. FreeBSD implemented querlylocale() (but it's been buggy and not well thought out). glibc had to have for internal use a way to get the name. Again, it's my speculation, but I guess that they decided that since the Standard didn't have this, they shouldn't expose it, so they didn't document it, and hid it away in the nl_langinfo() function. At some point I discovered it, but wasn't confident to make using it the default, so that's what that Configure option does is to say use this function. It #ifdefs out the kludgy code I put in to handle the cases where there really is no way to get the locale name. That code is what is broken here. And instead uses the undocumented glibc feature. We enabled that feature by default early in the process of developing 5.40, and got no reported bugs in it, so 5.40 was shipped with it enabled.

Now, this may still be a problem for non-glibc platforms that don't have some querylocale-type functionality.

I apologize for not realizing this earlier. It looked like it would require a patch, so I waited until we actually were working on a 5.38 maintenance release

khwilliamson avatar Sep 30 '24 21:09 khwilliamson

Restoring 7af2d20 likely means that #22195 automatically gets fixed.

khwilliamson avatar Sep 30 '24 21:09 khwilliamson

My initial tests seem to confirm that adding -DUSE_NL_LOCALE_NAME is the better fix we were looking for, and that together with restoring 7af2d20 both known issues go away. @khwilliamson many thanks!

I'm not sure yet how wide testing I can manage inside Debian, as we have things just about ready for updating to 5.40 soon. We haven't made a stable release of Debian with Perl 5.38, so after the switch to 5.40 this issue will not affect our users anymore. I do have some regrets about how I handled this bug on the Debian side, so if there's still time to update and test the 5.38 Perl package inside Debian I'll try to make use of it and report back here.

However, as noted in #22195 Ubuntu did make a stable release with Perl 5.38 and the "bad fix" (=simple revert of 7af2d20) and their users will stay affected. I don't work on Ubuntu myself and I don't have a say about what they fix in their stable release. They just pull packages from Debian during their development cycle.

ntyni avatar Oct 01 '24 21:10 ntyni

Unfortunately it turns out that -DUSE_NL_LOCALE_NAME changes the binary interface.

loadable library and perl binaries are mismatched (got first handshake key 0xf380080, needed 0xed00080)

The PerlInterpreter structure shrank from 3896 to 3792, dropping at least the cur_LC_ALL member.

This means that we can't adopt this for the Debian Perl 5.38 packages as-is without rebuilding all the XS module packages the same way we do with major Perl version transitions. That's not going to happen at this point, we're moving to 5.40 instead. I also doubt Ubuntu can/will use this fix for their stable release.

(I didn't look at how much patching it would require to keep the structs stable.)

ntyni avatar Oct 04 '24 19:10 ntyni

I will experiment with making this easier for 5.38.

khwilliamson avatar Oct 04 '24 21:10 khwilliamson

Try this patch PL_curlocales.txt

khwilliamson avatar Oct 05 '24 16:10 khwilliamson

Thanks for the PL_curlocales patch. I verified that it fixes the ABI compatibility issue.

Unfortunately I didn't manage to upload this to Debian before the time window for 5.40 opened. We've just started the 5.40 transition now. So I can't test the fix "for real" anymore. Sorry that didn't work out.

To summarize, to my limited tests indicated that adding -DUSE_NL_LOCALE_NAME to ccflags with the ABI compat patch is a valid solution for this issue.

Thanks again for your work.

ntyni avatar Oct 14 '24 18:10 ntyni

@steve-m-hay How should I proceed with this for 5.38 maintenance? I'm thinking I should formalize the patch at https://github.com/user-attachments/files/17266836/PL_curlocales.txt and add a bit that changes the default in Configure to -DUSE_NL_LOCALE_NAME

khwilliamson avatar Oct 15 '24 14:10 khwilliamson

Apologies for not replying to this sooner. I see that you formalized your patch in PR#22701, but that now appears to be superseded by PR#22792? The main commit in that PR (ecd3fa8) looks reasonable for backporting to 5.38, but is there a reason why jkeenan's original proposal to backport bf38d1c from blead wouldn't work? I had previously not been keen on that since I mistakenly thought that backporting that in Debian was the cause of GH#22195, but upon re-reading the thread above I now see that is not what happened: It was actually their reversal of 7af2d20 which caused that problem.

steve-m-hay avatar Dec 30 '24 13:12 steve-m-hay

Just a note that I lightly tested perl-5.38.3-RC1 in a Ubuntu 24.10 chroot, and I can confirm that this issue looks fixed now without introducing binary incompatibilities or other grief. Many thanks @khwilliamson and @steve-m-hay for finding a clean solution!

ntyni avatar Jan 07 '25 21:01 ntyni

This is fixed with the release of 5.38.3

khwilliamson avatar Jan 19 '25 12:01 khwilliamson