perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

Perl SEGV

Open ancientwizard opened this issue 6 months ago • 15 comments

Module:

Description Perl SEGV's during SIGCHLD.

BTW I opened #23241 before but that issue has been fixed by a patch and pull request submitted the DBD::Oracle myself. This isn't an Oracle only issue. Apparently and dynamic library that creates threads could find their way into Perl land trying to service an interrupt.

Steps to Reproduce The steps to reproduce are reported within https://github.com/perl5-dbi/DBD-Oracle/issues/192 I have added a test that reproduces the SEGV t/92-segv-fork.t

Expected behavior Perl should not SEGV

Perl configuration

CFLAGS="-fsanitize=address -g -fno-omit-frame-pointer" LDFLAGS="-fsanitize=address" \
./Configure -des -Dusethreads -Duse64bitint -Duse64bitall -Dprefix=/your-install-path \
      -Alddlflags="-shared -g -O0 -L/usr/local/lib -fstack-protector-strong" \
      -Accflags="-g -O0"

ancientwizard avatar May 25 '25 19:05 ancientwizard

Possibly fixed by 85e97066288

tonycoz avatar May 25 '25 23:05 tonycoz

Possibly fixed by 85e9706

So you're suggesting that the Oracle instant client lib starts one or more additional non Perl threads that may leak into Perl land causing mayhem? That wouldn't seem like the typical M.O. of a dynamic library.

ancientwizard avatar May 26 '25 00:05 ancientwizard

It's only a problem when you mix library worker threads with signals.

Your backtrace:

#6  0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1
#7  0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

shows this isn't a perl thread.

tonycoz avatar May 26 '25 00:05 tonycoz

It's only a problem when you mix library worker threads with signals.

Your backtrace:

#6  0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1
#7  0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

shows this isn't a perl thread.

Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.

I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to A) not create threads B) keep its threads on a leash!

I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.

ancientwizard avatar May 26 '25 18:05 ancientwizard

So does this still happen on blead or not? It's not clear which Perl version you're using.

xenu avatar May 26 '25 19:05 xenu

Possibly fixed by 85e9706

I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their pthread_t type is now a 16 byte struct and not a 8 byte size_t opaque pointer. I forgot the rational but either it removes 1 level of ptr derefs, or a pthread_t object has no destructor and no memory behind it anymore, its not ref counted, its not malloced, its pass by copy, with a benefit of infinite CPU core count scaling because there is no more user mode global lock for getter/setter methods for pthread metadata/TLS.

Upgrading to 16 byte structs did cause widespread generic common POSIX app breakage, because suddenly == C operator syntax errors.

Google brings up https://forums.oracle.com/ords/apexds/post/issue-with-system-call-with-libclntsh-so-19-1-0673 and SIGCHILD drama.

bulk88 avatar May 26 '25 19:05 bulk88

Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.

I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to A) not create threads B) keep its threads on a leash!

I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.

SP my API names before using them. dTHX;/Perl_get_context();/ENTER/SAVETEMPS/PUSHMARK/Perl_call_sv() executing on a random OS thread, from a OS/LibC thread pool, is always user error. And a very common recurrent mousetrap bug for all Perl CPAN XS code running on Windows. Without detaching a my_perl ptr from 1 OS thread, pause/sleeping/idling the old OS thread, and attach the my_perl ptr to a new OS thread with Perl_set_context(), any other behavior is illegal/invalid.

1 my_perl ptr can be bound to exactly 1 OS TID at any given time.

WinPerl has an my_perl ptr of last resort for signal handlers and a very small % of all possible permutations of thread pool accidents (most will just SEGV on WinPerl). See PL_curinterp specifically. IDK what Perl on other OSes do. Perl also has a "safe signals" API that packetizes/async (0.25-30 ms away) resends the unix signal event back to the root my_perl, to execute at a safe control flow location inside the run loop. its called PERL_ASYNC_CHECK().

interpreter * perl_alloc_using(IPerlMem **ipM, IPerlMem **ipMS, IPerlMem **ipMP, IPerlEnv **ipE, IPerlStdIO **ipStd, IPerlLIO **ipLIO, IPerlDir **ipD, IPerlSock **ipS, IPerlProc **ipP)
{
  interpreter *my_perl;
  _PerlIO **fd;
  interpreter *result;

  my_perl = (interpreter *)(*ipM)->pCalloc(ipM, 1ui64, 4816ui64);
  if ( PL_curinterp )
  {
    Perl_set_context(my_perl);
  }
  else
  {
    PL_curinterp = my_perl;
    if ( my_perl && !PL_veto_switch_non_tTHX_context )
      Perl_switch_locale_context(my_perl);
    PL_thr_key = TlsAlloc();
    if ( PL_thr_key == -1 )
    {
      fd = Perl_PerlIO_stderr(my_perl);
      PerlIO_printf(fd, "panic: TlsAlloc");
      exit(1);
    }
    Perl_set_context(my_perl);
    InitializeCriticalSection(&PL_op_mutex);
    InitializeCriticalSection(&PL_check_mutex);
    InitializeCriticalSection(&PL_keyword_plugin_mutex);
    InitializeCriticalSection(&PL_hints_mutex);
    InitializeCriticalSection(&PL_locale_mutex.lock);

bulk88 avatar May 26 '25 20:05 bulk88

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

jkeenan avatar May 26 '25 21:05 jkeenan

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

This is a Perl vs libraries that create non Perl worker threads. PERL can't cope with non Perl threads entering its domain.

ancientwizard avatar May 26 '25 21:05 ancientwizard

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

It appears to be the same as #22487 which is fixed in blead.

We won't know if it fixes it for OP until they test it, though they might run into other DBD::Oracle issues along the way.

Edit: if you're using a vendor perl you might be able to convince them to backport https://github.com/Perl/perl5/commit/85e97066288fa0be7b08897c6306e33ea2ef45d7

tonycoz avatar May 26 '25 22:05 tonycoz

I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their pthread_t type is now a 16 byte struct and not a 8 byte size_t opaque pointer

pthreads isn't a specific implementation, but a specification. That specification does allow for non-scalar pthread_t which is why the test code in https://github.com/Perl/perl5/commit/85e97066288fa0be7b08897c6306e33ea2ef45d7 uses pthread_equal().

Though I kind of doubt any implementation does use non-scalar pthread_t - it would break too much code.

I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads.

The thread start function for the thread is in libclntsh.so.23.1 rather than in the perl binary, libperl, or threads.so (the loadable object that does perl threads)

tonycoz avatar May 26 '25 22:05 tonycoz

Though I kind of doubt any implementation does use non-scalar pthread_t - it would break too much code.

https://www.ibm.com/docs/en/open-xl-c-cpp-zos/2.1.0?topic=guide-binary-compatibility

Internal type went from char thd [8] to void * I guess.

https://github.com/nginx/unit/blob/master/src/nxt_thread_id.h#L171 good reference.

It seems from my opinion, to say, certain pthread impls, pack a ptr and a 0 based kernel tid or a certain readable or unreadable memory address that the kernel returns as a "thread obj ref", together in 1 void * or into 8 bytes. The bitfield space is gained by extreme alignment of the master pthread struct, at 4096, or 10s of KBs to free up space in the void * for the kernel TID. The platform/CPU/OS might be lacking a dedicated TLS register, and having to ask the kernel each time for an abstract TID, and somehow convert the abstract number back into their large pthread internal C struct.

bulk88 avatar May 28 '25 05:05 bulk88

I have tested the patch placed in "bleed" (mentioned above) that forces Perl "WOOL' on to the non-Perl thread and then allows it to continue to handle the interrupt. I've also tested an alternative (though a real Perl internals developer should make it conform to Perl standards); Both work well to eliminate the SEGV when a non-Perl worker thread is called upon to enter Perl land to handle an interrupt. For the time being I'll apply said patch to any Perl I build within my organization as I can't use bleed there.

The alternative is shown here: https://github.com/perl5-dbi/DBD-Oracle/issues/192

ancientwizard avatar May 31 '25 19:05 ancientwizard

The patch you show as applied to blead in perl5-dbi/DBD-Oracle#192 isn't what was applied to blead. blead (5.42 to be) currently uses pthread_kill() to forward the signal to the main thread.

tonycoz avatar Jun 01 '25 01:06 tonycoz

The example I used was something I made up (the alternative) simply to convey the intent not the final code. I'll have another look and if there is a bleed fix that passes the interrupt along that's a better solution and why I offered the example SU-DO code.

As for the the original approach that someone claimed to put in bleed is on them. I saw the code and the comment they made and I tested it. I don't really know anything about bleed. I just tried to connect the dots so others could follow If they wished.

ancientwizard avatar Jun 01 '25 03:06 ancientwizard

I think this is closable, it has a fix in blead and I've added that fix to the 5.40 and 5.38 votes files.

tonycoz avatar Jul 01 '25 00:07 tonycoz

OK. Possibly a stupid question. Is this in any way a potential security vulnerability (segfaults often are)? If so, can a CVE be registered. That will then trigger some backports to get this fix into production systems without jumping through ludicrous hoops... Thanks, Justin

justinschoeman avatar Jul 01 '25 13:07 justinschoeman

OK. Possibly a stupid question. Is this in any way a potential security vulnerability (segfaults often are)? If so, can a CVE be registered. That will then trigger some backports to get this fix into production systems without jumping through ludicrous hoops... Thanks, Justin

I don't think so. That would require a user to be able to trigger this behavior.

Leont avatar Jul 01 '25 17:07 Leont

It's a NULL pointer dereference, I don't see how it could be a security issue on modern systems.

tonycoz avatar Jul 01 '25 23:07 tonycoz

Another silly question. Is it possible that this patch could cause signals to be lost to libraries in forked threads? (Or is there a significant difference in Perl 5.32 which could make this patch behave differently?)

After patching, the segfaults are finally gone, but I am now seeing issues like:

20250707214819:297011:1:Child returned: 0
20250707214941:224648:1:SIGINT!
20250707214941:297011:1:Dump logs...
20250707214941:289682:1:Dump logs...

(First number is a timestamp, 2nd the PID, then log level and log message.) The first row is immediately before dispatching an SQL query to DBD::Oracle, there is then an 80s gap until an external watchdog notices the Perl script is frozen and sends SIGINT. Then two forked processes (both waiting on Oracle Client Library calls to complete) suddenly resume.

The SQL that was supposed to be execute by those queries was successfully executed (verified in DB).

So it seems the Oracle Client Library may be blocking on signals which are not arriving.

justinschoeman avatar Jul 08 '25 12:07 justinschoeman

Another silly question. Is it possible that this patch could cause signals to be lost to libraries in forked threads? (Or is there a significant difference in Perl 5.32 which could make this patch behave differently?)

Yeah, the current logic wouldn't work after a fork. Some pthread_at_fork logic could probably fix that.

Leont avatar Jul 08 '25 22:07 Leont

The change here has no effect on signals managed by the Oracle Client Library, only on signals which you've set a perl signal handler for.

You say "forked threads" - do you mean perl threads or new processes created with fork()?

There is a problem here but I don't think this change is losing signals OCL is managing unless you're hooking those signals from perl (in which case don't do that in any case.)

tonycoz avatar Jul 08 '25 23:07 tonycoz

On Linux at least, the main thread has the same thread id from pthread_self() in both the parent and child processes.

tonycoz avatar Jul 09 '25 01:07 tonycoz

The change here has no effect on signals managed by the Oracle Client Library, only on signals which you've set a perl signal handler for.

You say "forked threads" - do you mean perl threads or new processes created with fork()?

There is a problem here but I don't think this change is losing signals OCL is managing unless you're hooking those signals from perl (in which case don't do that in any case.)

This is fork() :

  my $i = fork();
  if(!defined $i) {
    LOG(0, "fork failed: $!");
    # probably leaked and out of resources - bail and restart...
    return 0;
  }
  if($i == 0) {
    LOG(10, "child started");
    # child must not close master dbh on exit
    $mdbh->{InactiveDestroy} = 1;
    # No signal handler for the child?
    $SIG{CHLD} = 'DEFAULT';
    $req->{'child'} = $$;
    # get child db handle
    $req->{'dbh'} = DBI->connect("dbi:Oracle:host=$conf_db_host;sid=$conf_db_sid;port=$conf_db_port", "$conf_db_username", "$conf_db_password", { RaiseError => 0, AutoCommit => 1});

I do hook SIGINT and SIGCHLD - but reset SIGCHLD soon after fork().

This issue is remarkably difficult to replicate. I am starting to wonder if it was not transient flakiness from AWS again...

justinschoeman avatar Jul 09 '25 13:07 justinschoeman

On Linux at least, the main thread has the same thread id from pthread_self() in both the parent and child processes.

What do you mean with the same? They're opaque data so comparing pointers is meaningless. And you can't really pthread_equal them across the fork.

Ultimately, underneath the hood it's all about the TID (at least on Linux), and that will absolutely be different.

Leont avatar Jul 09 '25 21:07 Leont

What do you mean with the same? They're opaque data so comparing pointers is meaningless. And you can't really pthread_equal them across the fork.

Without the atfork hook change from the commit github lists above, pthread_equal() returned true when I compared the saved main thread id from the original process and the pthread_self() in the forked child.

Ultimately, underneath the hood it's all about the TID (at least on Linux), and that will absolutely be different.

From pthread_self(3):

The thread ID returned by pthread_self() is not the same thing as the kernel thread ID returned by a call to gettid(2).

tonycoz avatar Jul 09 '25 21:07 tonycoz

To check I hadn't messed up, on Linux:

tony@venus:.../perl/git$ cc -o23226-fork 23226-fork.c -lpthread
tony@venus:.../perl/git$ ./23226-fork 
child self match Yes
tony@venus:.../perl/git$ cat 23226-fork.c
#include <pthread.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>

int main() {
  pthread_t slf = pthread_self();

  pid_t p = fork();
  if (p == -1) {
    perror("fork");
    return 1;
  }

  if (p == 0) {
    // child
    printf("child self match %s\n", pthread_equal(slf, pthread_self()) ? "Yes" : "No");
    exit(0);
  }
  int status;
  waitpid(p, &status, 0);
}

I get the same result on the freebsd I checked:

ony@freebsd13:~/play $ cc -o23226-fork 23226-fork.c -lpthread
tony@freebsd13:~/play $ ./23226-fork
child self match Yes
tony@freebsd13:~/play $ uname -a
FreeBSD freebsd13 13.0-CURRENT FreeBSD 13.0-CURRENT r345355 GENERIC  amd64

tonycoz avatar Jul 09 '25 22:07 tonycoz

cc -o23226-fork 23226-fork.c -lpthread

I got same results on Linux and FreeBSD-14, e.g.:

$ uname -mrs
FreeBSD 14.2-RELEASE-p1 amd64
$ ./23226-fork
child self match Yes

jkeenan avatar Jul 10 '25 02:07 jkeenan

I think you believe this test shows that the child is using the same thread as the parent. Not possible. No amount of looking at thread id compares will change that. I think you should read about forking, copy on write (as pertains to forking). Next I'd read about interrupt dispatching. I kinda recall that for some signals (may be mistaken) are propagated to the child but this could be a SHELL behavior. I.E. a signal to the shell is forwarded to the child. GL

ancientwizard avatar Jul 10 '25 10:07 ancientwizard

I think you believe this test shows that the child is using the same thread as the parent. Not possible. No amount of looking at thread id compares will change that. I think you should read about forking, copy on write (as pertains to forking). Next I'd read about interrupt dispatching. I kinda recall that for some signals (may be mistaken) are propagated to the child but this could be a SHELL behavior. I.E. a signal to the shell is forwarded to the child. GL

I don't believe that at all and I don't know how you reached that conclusion.

I do believe that:

  • the pthread_t is only required to be unique within a process, so the main thread (or any threads) in separate processes can have the same pthread_t value
  • I suspect the FreeBSD libc and glibc use a constant pthread_t value for the main thread in the process, which is permitted due to the above
  • the unchanging pthread_t isn't required by POSIX, and may vary from revision to revision of libc, so we still do need to update the PL_main_thread on fork, and the commits github link above do that.

This last change won't change behaviour as tested on Linux and FreeBSD, but it is the correct thing to do.

tonycoz avatar Jul 10 '25 11:07 tonycoz