perl5
perl5 copied to clipboard
Perl SEGV
Module:
Description Perl SEGV's during SIGCHLD.
BTW I opened #23241 before but that issue has been fixed by a patch and pull request submitted the DBD::Oracle myself. This isn't an Oracle only issue. Apparently and dynamic library that creates threads could find their way into Perl land trying to service an interrupt.
Steps to Reproduce The steps to reproduce are reported within https://github.com/perl5-dbi/DBD-Oracle/issues/192 I have added a test that reproduces the SEGV t/92-segv-fork.t
Expected behavior Perl should not SEGV
Perl configuration
CFLAGS="-fsanitize=address -g -fno-omit-frame-pointer" LDFLAGS="-fsanitize=address" \
./Configure -des -Dusethreads -Duse64bitint -Duse64bitall -Dprefix=/your-install-path \
-Alddlflags="-shared -g -O0 -L/usr/local/lib -fstack-protector-strong" \
-Accflags="-g -O0"
Possibly fixed by 85e97066288
Possibly fixed by 85e9706
So you're suggesting that the Oracle instant client lib starts one or more additional non Perl threads that may leak into Perl land causing mayhem? That wouldn't seem like the typical M.O. of a dynamic library.
It's only a problem when you mix library worker threads with signals.
Your backtrace:
#6 0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1
#7 0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#8 0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
shows this isn't a perl thread.
It's only a problem when you mix library worker threads with signals.
Your backtrace:
#6 0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1 #7 0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479 #8 0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95shows this isn't a perl thread.
Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.
I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to A) not create threads B) keep its threads on a leash!
I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.
So does this still happen on blead or not? It's not clear which Perl version you're using.
Possibly fixed by 85e9706
I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their pthread_t type is now a 16 byte struct and not a 8 byte size_t opaque pointer. I forgot the rational but either it removes 1 level of ptr derefs, or a pthread_t object has no destructor and no memory behind it anymore, its not ref counted, its not malloced, its pass by copy, with a benefit of infinite CPU core count scaling because there is no more user mode global lock for getter/setter methods for pthread metadata/TLS.
Upgrading to 16 byte structs did cause widespread generic common POSIX app breakage, because suddenly == C operator syntax errors.
Google brings up https://forums.oracle.com/ords/apexds/post/issue-with-system-call-with-libclntsh-so-19-1-0673 and SIGCHILD drama.
Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.
I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to A) not create threads B) keep its threads on a leash!
I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.
SP my API names before using them. dTHX;/Perl_get_context();/ENTER/SAVETEMPS/PUSHMARK/Perl_call_sv() executing on a random OS thread, from a OS/LibC thread pool, is always user error. And a very common recurrent mousetrap bug for all Perl CPAN XS code running on Windows. Without detaching a my_perl ptr from 1 OS thread, pause/sleeping/idling the old OS thread, and attach the my_perl ptr to a new OS thread with Perl_set_context(), any other behavior is illegal/invalid.
1 my_perl ptr can be bound to exactly 1 OS TID at any given time.
WinPerl has an my_perl ptr of last resort for signal handlers and a very small % of all possible permutations of thread pool accidents (most will just SEGV on WinPerl). See PL_curinterp specifically. IDK what Perl on other OSes do. Perl also has a "safe signals" API that packetizes/async (0.25-30 ms away) resends the unix signal event back to the root my_perl, to execute at a safe control flow location inside the run loop. its called PERL_ASYNC_CHECK().
interpreter * perl_alloc_using(IPerlMem **ipM, IPerlMem **ipMS, IPerlMem **ipMP, IPerlEnv **ipE, IPerlStdIO **ipStd, IPerlLIO **ipLIO, IPerlDir **ipD, IPerlSock **ipS, IPerlProc **ipP)
{
interpreter *my_perl;
_PerlIO **fd;
interpreter *result;
my_perl = (interpreter *)(*ipM)->pCalloc(ipM, 1ui64, 4816ui64);
if ( PL_curinterp )
{
Perl_set_context(my_perl);
}
else
{
PL_curinterp = my_perl;
if ( my_perl && !PL_veto_switch_non_tTHX_context )
Perl_switch_locale_context(my_perl);
PL_thr_key = TlsAlloc();
if ( PL_thr_key == -1 )
{
fd = Perl_PerlIO_stderr(my_perl);
PerlIO_printf(fd, "panic: TlsAlloc");
exit(1);
}
Perl_set_context(my_perl);
InitializeCriticalSection(&PL_op_mutex);
InitializeCriticalSection(&PL_check_mutex);
InitializeCriticalSection(&PL_keyword_plugin_mutex);
InitializeCriticalSection(&PL_hints_mutex);
InitializeCriticalSection(&PL_locale_mutex.lock);
@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.
@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.
This is a Perl vs libraries that create non Perl worker threads. PERL can't cope with non Perl threads entering its domain.
@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.
It appears to be the same as #22487 which is fixed in blead.
We won't know if it fixes it for OP until they test it, though they might run into other DBD::Oracle issues along the way.
Edit: if you're using a vendor perl you might be able to convince them to backport https://github.com/Perl/perl5/commit/85e97066288fa0be7b08897c6306e33ea2ef45d7
I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their
pthread_ttype is now a 16 byte struct and not a 8 bytesize_topaque pointer
pthreads isn't a specific implementation, but a specification. That specification does allow for non-scalar pthread_t which is why the test code in https://github.com/Perl/perl5/commit/85e97066288fa0be7b08897c6306e33ea2ef45d7 uses pthread_equal().
Though I kind of doubt any implementation does use non-scalar pthread_t - it would break too much code.
I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads.
The thread start function for the thread is in libclntsh.so.23.1 rather than in the perl binary, libperl, or threads.so (the loadable object that does perl threads)
Though I kind of doubt any implementation does use non-scalar
pthread_t- it would break too much code.
https://www.ibm.com/docs/en/open-xl-c-cpp-zos/2.1.0?topic=guide-binary-compatibility
Internal type went from char thd [8] to void * I guess.
https://github.com/nginx/unit/blob/master/src/nxt_thread_id.h#L171 good reference.
It seems from my opinion, to say, certain pthread impls, pack a ptr and a 0 based kernel tid or a certain readable or unreadable memory address that the kernel returns as a "thread obj ref", together in 1 void * or into 8 bytes. The bitfield space is gained by extreme alignment of the master pthread struct, at 4096, or 10s of KBs to free up space in the void * for the kernel TID. The platform/CPU/OS might be lacking a dedicated TLS register, and having to ask the kernel each time for an abstract TID, and somehow convert the abstract number back into their large pthread internal C struct.
I have tested the patch placed in "bleed" (mentioned above) that forces Perl "WOOL' on to the non-Perl thread and then allows it to continue to handle the interrupt. I've also tested an alternative (though a real Perl internals developer should make it conform to Perl standards); Both work well to eliminate the SEGV when a non-Perl worker thread is called upon to enter Perl land to handle an interrupt. For the time being I'll apply said patch to any Perl I build within my organization as I can't use bleed there.
The alternative is shown here: https://github.com/perl5-dbi/DBD-Oracle/issues/192
The patch you show as applied to blead in perl5-dbi/DBD-Oracle#192 isn't what was applied to blead. blead (5.42 to be) currently uses pthread_kill() to forward the signal to the main thread.
The example I used was something I made up (the alternative) simply to convey the intent not the final code. I'll have another look and if there is a bleed fix that passes the interrupt along that's a better solution and why I offered the example SU-DO code.
As for the the original approach that someone claimed to put in bleed is on them. I saw the code and the comment they made and I tested it. I don't really know anything about bleed. I just tried to connect the dots so others could follow If they wished.
I think this is closable, it has a fix in blead and I've added that fix to the 5.40 and 5.38 votes files.
OK. Possibly a stupid question. Is this in any way a potential security vulnerability (segfaults often are)? If so, can a CVE be registered. That will then trigger some backports to get this fix into production systems without jumping through ludicrous hoops... Thanks, Justin
OK. Possibly a stupid question. Is this in any way a potential security vulnerability (segfaults often are)? If so, can a CVE be registered. That will then trigger some backports to get this fix into production systems without jumping through ludicrous hoops... Thanks, Justin
I don't think so. That would require a user to be able to trigger this behavior.
It's a NULL pointer dereference, I don't see how it could be a security issue on modern systems.
Another silly question. Is it possible that this patch could cause signals to be lost to libraries in forked threads? (Or is there a significant difference in Perl 5.32 which could make this patch behave differently?)
After patching, the segfaults are finally gone, but I am now seeing issues like:
20250707214819:297011:1:Child returned: 0
20250707214941:224648:1:SIGINT!
20250707214941:297011:1:Dump logs...
20250707214941:289682:1:Dump logs...
(First number is a timestamp, 2nd the PID, then log level and log message.) The first row is immediately before dispatching an SQL query to DBD::Oracle, there is then an 80s gap until an external watchdog notices the Perl script is frozen and sends SIGINT. Then two forked processes (both waiting on Oracle Client Library calls to complete) suddenly resume.
The SQL that was supposed to be execute by those queries was successfully executed (verified in DB).
So it seems the Oracle Client Library may be blocking on signals which are not arriving.
Another silly question. Is it possible that this patch could cause signals to be lost to libraries in forked threads? (Or is there a significant difference in Perl 5.32 which could make this patch behave differently?)
Yeah, the current logic wouldn't work after a fork. Some pthread_at_fork logic could probably fix that.
The change here has no effect on signals managed by the Oracle Client Library, only on signals which you've set a perl signal handler for.
You say "forked threads" - do you mean perl threads or new processes created with fork()?
There is a problem here but I don't think this change is losing signals OCL is managing unless you're hooking those signals from perl (in which case don't do that in any case.)
On Linux at least, the main thread has the same thread id from pthread_self() in both the parent and child processes.
The change here has no effect on signals managed by the Oracle Client Library, only on signals which you've set a perl signal handler for.
You say "forked threads" - do you mean perl threads or new processes created with fork()?
There is a problem here but I don't think this change is losing signals OCL is managing unless you're hooking those signals from perl (in which case don't do that in any case.)
This is fork() :
my $i = fork();
if(!defined $i) {
LOG(0, "fork failed: $!");
# probably leaked and out of resources - bail and restart...
return 0;
}
if($i == 0) {
LOG(10, "child started");
# child must not close master dbh on exit
$mdbh->{InactiveDestroy} = 1;
# No signal handler for the child?
$SIG{CHLD} = 'DEFAULT';
$req->{'child'} = $$;
# get child db handle
$req->{'dbh'} = DBI->connect("dbi:Oracle:host=$conf_db_host;sid=$conf_db_sid;port=$conf_db_port", "$conf_db_username", "$conf_db_password", { RaiseError => 0, AutoCommit => 1});
I do hook SIGINT and SIGCHLD - but reset SIGCHLD soon after fork().
This issue is remarkably difficult to replicate. I am starting to wonder if it was not transient flakiness from AWS again...
On Linux at least, the main thread has the same thread id from pthread_self() in both the parent and child processes.
What do you mean with the same? They're opaque data so comparing pointers is meaningless. And you can't really pthread_equal them across the fork.
Ultimately, underneath the hood it's all about the TID (at least on Linux), and that will absolutely be different.
What do you mean with the same? They're opaque data so comparing pointers is meaningless. And you can't really pthread_equal them across the fork.
Without the atfork hook change from the commit github lists above, pthread_equal() returned true when I compared the saved main thread id from the original process and the pthread_self() in the forked child.
Ultimately, underneath the hood it's all about the TID (at least on Linux), and that will absolutely be different.
From pthread_self(3):
The thread ID returned by pthread_self() is not the same thing as the kernel thread ID returned by a call to gettid(2).
To check I hadn't messed up, on Linux:
tony@venus:.../perl/git$ cc -o23226-fork 23226-fork.c -lpthread
tony@venus:.../perl/git$ ./23226-fork
child self match Yes
tony@venus:.../perl/git$ cat 23226-fork.c
#include <pthread.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
int main() {
pthread_t slf = pthread_self();
pid_t p = fork();
if (p == -1) {
perror("fork");
return 1;
}
if (p == 0) {
// child
printf("child self match %s\n", pthread_equal(slf, pthread_self()) ? "Yes" : "No");
exit(0);
}
int status;
waitpid(p, &status, 0);
}
I get the same result on the freebsd I checked:
ony@freebsd13:~/play $ cc -o23226-fork 23226-fork.c -lpthread
tony@freebsd13:~/play $ ./23226-fork
child self match Yes
tony@freebsd13:~/play $ uname -a
FreeBSD freebsd13 13.0-CURRENT FreeBSD 13.0-CURRENT r345355 GENERIC amd64
cc -o23226-fork 23226-fork.c -lpthread
I got same results on Linux and FreeBSD-14, e.g.:
$ uname -mrs
FreeBSD 14.2-RELEASE-p1 amd64
$ ./23226-fork
child self match Yes
I think you believe this test shows that the child is using the same thread as the parent. Not possible. No amount of looking at thread id compares will change that. I think you should read about forking, copy on write (as pertains to forking). Next I'd read about interrupt dispatching. I kinda recall that for some signals (may be mistaken) are propagated to the child but this could be a SHELL behavior. I.E. a signal to the shell is forwarded to the child. GL
I think you believe this test shows that the child is using the same thread as the parent. Not possible. No amount of looking at thread id compares will change that. I think you should read about forking, copy on write (as pertains to forking). Next I'd read about interrupt dispatching. I kinda recall that for some signals (may be mistaken) are propagated to the child but this could be a SHELL behavior. I.E. a signal to the shell is forwarded to the child. GL
I don't believe that at all and I don't know how you reached that conclusion.
I do believe that:
- the pthread_t is only required to be unique within a process, so the main thread (or any threads) in separate processes can have the same pthread_t value
- I suspect the FreeBSD libc and glibc use a constant pthread_t value for the main thread in the process, which is permitted due to the above
- the unchanging pthread_t isn't required by POSIX, and may vary from revision to revision of libc, so we still do need to update the PL_main_thread on fork, and the commits github link above do that.
This last change won't change behaviour as tested on Linux and FreeBSD, but it is the correct thing to do.