firebird icon indicating copy to clipboard operation
firebird copied to clipboard

pthread_rwlock_unlock() fails called in ~SharedMemoryBase() for a wrong thread

Open mariuszzy opened this issue 3 years ago • 18 comments

First of all please be gentle, I'm not a multithreaded software expert. This is my atemtp to build Firebird 4.0.1.2692-0 under Solaris 10 x86 with GCC 8.5.

There are two symptoms:

  1. In the log file an error is reported: Operating system call pthread_rwlock_unlock failed. Error code 1 rwlock end-operation error operating system directive endRead(): pthread_rwlock_unlock failed Not owner
  2. isql connection stalls.

An error is raised in RWLock::endRead(). What I 've figured out is that ~SharedMemoryBase() destructor tries to FileLock::unlock() but is not the thread for which FileLock::setlock() was done. A simple trace is as follows:

  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 4 file: fb_tpc_c1000000200000008c56000000000000_0

I have ServerMode = Classic in firebird.conf

mariuszzy avatar Aug 10 '22 12:08 mariuszzy

On 8/10/22 15:28, mariuszzy wrote:

First of all please be gentle, I'm not a multithreaded software expert. This is my atemtp to build Firebird 4.0.1.2692-0 under Solaris 10 x86 with GCC 8.5.

There are two symptoms:

  1. In the log file an error is reported: Operating system call pthread_rwlock_unlock failed. Error code 1 rwlock end-operation error operating system directive endRead(): pthread_rwlock_unlock failed Not owner
  2. isql connection stalls.

An error is raised in RWLock::endRead(). What I 've figured out is that ~SharedMemoryBase() destructor tries to FileLock::unlock() but is not the thread for which FileLock::setlock() was done. A simple trace is as follows:

  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 4 file: fb_tpc_c1000000200000008c56000000000000_0

Can you switch to debug build (provided you use release currently) and provide stack trace on error?

AlexPeshkoff avatar Aug 10 '22 14:08 AlexPeshkoff

I've already tried to build in dev mode but given up as there were tons of compilation errors. It took around a week to get Firebird 4 running any way. I'll give it a try.

mariuszzy avatar Aug 10 '22 19:08 mariuszzy

I've succeded building debug binaries but I don't have gdb here. In this build it crashed early during connection. Any way pthread_rwlock_unlock() failed. This is pstack output.

core '/tmp/core' of 10289: fbserver -d -p 3050 ----------------- lwp# 1 / thread# 1 -------------------- fffffd7ffeb4cb8a _lwp_kill () + a fffffd7ffeaf1dc9 raise () + 19 fffffd7ffead1340 abort () + 90 fffffd7ffdd1d3c5 _ZN8Firebird18system_call_failedC2EPKcS2_i.cold.18 () + 5 fffffd7ffdc1c8d0 _ZN8Firebird18system_call_failed5raiseEPKci () + 30 fffffd7ffdc22ea7 _ZN8Firebird8FileLock8rwUnlockEv () + 167 fffffd7ffdc23c36 _ZN8Firebird8FileLock6unlockEv () + 86 fffffd7ffdc2453c _ZN8Firebird8FileLockD1Ev () + 1c fffffd7ffdc25350 _ZN8Firebird16SharedMemoryBaseD1Ev () + 380 fffffd7ffd984142 _ZN3Jrd8TipCache15StatusBlockData5clearEPNS_9thread_dbE () + 32 fffffd7ffd984197 _ZN3Jrd8TipCache15StatusBlockDataD1Ev () + 27 fffffd7ffd984291 _ZN3Jrd8TipCache11finalizeTpcEPNS_9thread_dbE () + b1 fffffd7ffd91b930 _Z21JRD_shutdown_databasePN3Jrd8DatabaseEj () + 530 fffffd7ffd91f800 _ZL16purge_attachmentPN3Jrd9thread_dbEPNS_20StableAttachmentPartEj () + 360 fffffd7ffd920386 _ZN12_GLOBAL__N_119shutdownAttachmentsEPN3Jrd20AttachmentsRefHolderEl () + 1d6 fffffd7ffd92058e _ZL15shutdown_threadPv () + de fffffd7ffd920a37 _ZN3Jrd9JProvider8shutdownEPN8Firebird18CheckStatusWrapperEji () + d7 fffffd7ffdddbf4c ZN8Firebird17IProviderBaseImplIN3Jrd9JProviderENS_18CheckStatusWrapperENS_15IPluginBaseImplIS2_S3_NS_7InheritINS_21IReferenceCountedImplIS2_S3_NS5_INS_14IVersionedImplIS2_S3_NS5_INS_9IProviderEEEEEEEEEEEEEE23cloopshutdownDispatcherEPS8 () + 5c fffffd7ffef1b589 _ZN3Why10Dispatcher8shutdownEPN8Firebird18CheckStatusWrapperEji () + 509 fffffd7ffef1bff8 fb_shutdown () + 168 000000000057b3d1 main () + 5a1 0000000000458c9c _start () + 6c ----------------- lwp# 3 / thread# 3 -------------------- fffffd7ffeb472f7 __lwp_park () + 17 fffffd7ffeb3b59b sema_wait () + b fffffd7ffec16eb2 sem_wait () + 22 fffffd7ffefca972 _ZN8Firebird19SignalSafeSemaphore5enterEv () + 22 fffffd7ffef1c255 _ZN12_GLOBAL__N_114shutdownThreadEPv () + 35 fffffd7ffef7714a threadStart () + 6a fffffd7ffeb4704b _thr_setup () + 5b fffffd7ffeb47280 _lwp_start () ----------------- lwp# 5 / thread# 5 -------------------- fffffd7ffeb4cc5a ___lwp_cond_wait () + a fffffd7ffeb41059 cond_wait_kernel () + 59 fffffd7ffeb411f4 cond_wait_common () + 134 fffffd7ffeb41407 _cond_wait () + 47 fffffd7ffeb41446 cond_wait () + 26 fffffd7ffeb41489 pthread_cond_wait () + 9 fffffd7ffdc23403 _ZN8Firebird16SharedMemoryBase9eventWaitEPNS_7event_tEii () + 43 fffffd7ffdb6b02f _ZN3Jrd11LockManager22blocking_action_threadEv () + 1ff fffffd7ffdf33643 _ZN16ThreadFinishSyncIPN3Jrd11LockManagerEE11internalRunEPv () + 13 fffffd7ffdbfd48a threadStart () + 6a fffffd7ffeb4704b _thr_setup () + 5b fffffd7ffeb47280 _lwp_start () ----------------- lwp# 6 / thread# 6 -------------------- fffffd7ffeb4cc5a ___lwp_cond_wait () + a fffffd7ffeb41059 cond_wait_kernel () + 59 fffffd7ffeb411f4 cond_wait_common () + 134 fffffd7ffeb41407 _cond_wait () + 47 fffffd7ffeb41446 cond_wait () + 26 fffffd7ffeb41489 pthread_cond_wait () + 9 fffffd7ffdc23403 _ZN8Firebird16SharedMemoryBase9eventWaitEPNS_7event_tEii () + 43 fffffd7ffd8094bf ZN12_GLOBAL__N_110MappingIpc13clearDeliveryEPS0 () + 9f fffffd7ffd8032d3 _ZN16ThreadFinishSyncIPN12_GLOBAL__N_110MappingIpcEE11internalRunEPv () + 13 fffffd7ffdbfd48a threadStart () + 6a fffffd7ffeb4704b _thr_setup () + 5b fffffd7ffeb47280 _lwp_start ()

mariuszzy avatar Aug 10 '22 21:08 mariuszzy

baktrace.txt

Attached is the gdb baktrace. I Hope it helps to find a bug. Let me know if you need more details. Be careful that sometimes line numbers could be misleading as I've made some mods to track the problem down (isc_sync.cpp, rwlock.h).

I've connected remotely with isql and got SIGABRT after a couple of seconds.

mariuszzy avatar Aug 11 '22 19:08 mariuszzy

According to https://jira.mongodb.org/browse/WT-1277 this is really Solaris specific issue. I will take a look what can be done here.

AlexPeshkoff avatar Aug 12 '22 09:08 AlexPeshkoff

I've run through the referenced issue and tried the attached test http://people.freebsd.org/~emaste/snippets/rwlock.c with the following results:

sirius$ cc -m64 -mt -lpthread -o rwlock rwlock.c sirius$ ./rwlock Enter Testcase - ./rwlock invoking pthread_rwlock_wrlock(&rwlock) pthread_rwlock_wrlock(&rwlock) returned 0 (Error 0) invoking pthread_create(&td, NULL, unlock, NULL) pthread_create(&td, NULL, unlock, NULL) returned 0 (Error 0) invoking pthread_join(td, NULL) invoking pthread_rwlock_unlock(&rwlock) pthread_rwlock_unlock(&rwlock) returned 1 (Not owner) errno 0 (Error 0) pthread_join(td, NULL) returned 0 (Error 0)

I'm afraid it does not help much, just makes us sure that is the root of the problem.

mariuszzy avatar Aug 16 '22 20:08 mariuszzy

Try this https://drive.google.com/file/d/1r7DouGuRU9Y3-CHRVkOR8YHqYB0gJCUx/view?usp=sharing patch please. It's not complete - but I want to know is it buildable on solaris.

AlexPeshkoff avatar Oct 01 '22 17:10 AlexPeshkoff

Thank you Alexander for your support!

It has built cleanly and passes my basic tests. I'll give it more thorough tests Monday at work.

mariuszzy avatar Oct 01 '22 21:10 mariuszzy

Additional tests passed. Great job. Thanks a lot!

mariuszzy avatar Oct 03 '22 14:10 mariuszzy

On 10/3/22 17:11, mariuszzy wrote:

Additional tests passed. Great job. Thanks a lot!

Be careful - on my test there is a failure.

AlexPeshkoff avatar Oct 03 '22 14:10 AlexPeshkoff

On 22-10-03 16:24, Alexander Peshkov wrote:

On 10/3/22 17:11, mariuszzy wrote:

Additional tests passed. Great job. Thanks a lot!

Be careful - on my test there is a failure.

As I'm not multithreding expert my ugly fix was to prevent pthread_rwlock_unlock() being called on wrong thread like this:

if (level == LCK_NONE || this->thread != pthread_self())

in FileLock::unlock(). Generally it worked. Now I've rolled back my mods and applied yours. It just works so far in our sandbox. We will be testing Firebird 4 more extensively before deployment.

mariuszzy avatar Oct 03 '22 14:10 mariuszzy

This appears to be final patch. https://drive.google.com/drive/folders/13Egxproo3qSpeJ0mdECGPLQM0DeaHkch?usp=sharing Please try it.

AlexPeshkoff avatar Oct 04 '22 14:10 AlexPeshkoff

I see some serious improvements.

Just tested. It works!

On 22-10-04 16:18, Alexander Peshkov wrote:

This appears to be final patch. https://drive.google.com/drive/folders/13Egxproo3qSpeJ0mdECGPLQM0DeaHkch?usp=sharing Please try it.

— Reply to this email directly, view it on GitHub https://github.com/FirebirdSQL/firebird/issues/7253#issuecomment-1267079388, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2MSFVS52D6SQJEE3CC74TTWBQ4BXANCNFSM56EK3MGA. You are receiving this because you authored the thread.Message ID: @.***>

mariuszzy avatar Oct 04 '22 18:10 mariuszzy

I’ve already tried dev build but failed. Too many compilation errors. It took almost a week to get firebird 4 running.

I’ll try my best

Mariusz

Wiadomość napisana przez Alexander Peshkov @.***> w dniu 10.08.2022, o godz. 16:39:

 On 8/10/22 15:28, mariuszzy wrote:

First of all please be gentle, I'm not a multithreaded software expert. This is my atemtp to build Firebird 4.0.1.2692-0 under Solaris 10 x86 with GCC 8.5.

There are two symptoms:

  1. In the log file an error is reported: Operating system call pthread_rwlock_unlock failed. Error code 1 rwlock end-operation error operating system directive endRead(): pthread_rwlock_unlock failed Not owner
  2. isql connection stalls.

An error is raised in RWLock::endRead(). What I 've figured out is that ~SharedMemoryBase() destructor tries to FileLock::unlock() but is not the thread for which FileLock::setlock() was done. A simple trace is as follows:

  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 2 file: fb_tpc_c1000000200000008c56000000000000_0
  • SharedMemoryBase::~SharedMemoryBase(): thread: 4 file: fb_tpc_c1000000200000008c56000000000000_0

Can you switch to debug build (provided you use release currently) and provide stack trace on error?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

mariuszzy avatar Oct 11 '22 07:10 mariuszzy

On 10/2/22 00:28, mariuszzy wrote:

Thank you Alexander for your support!

It has built cleanly and passes my basic tests. I'll give it more thorough tests Monday at work.

It has at least one issue (developer's assertion), I'm investigating it.

AlexPeshkoff avatar Oct 11 '22 09:10 AlexPeshkoff

On 22-10-11 11:29, Alexander Peshkov wrote:

On 10/2/22 00:28, mariuszzy wrote:

Thank you Alexander for your support!

It has built cleanly and passes my basic tests. I'll give it more thorough tests Monday at work.

It has at least one issue (developer's assertion), I'm investigating it.

It has been deployed in one of our small shops (2-5 client apps) since couple of days. Works like a charm so far.

mariuszzy avatar Oct 11 '22 16:10 mariuszzy

On 10/11/22 19:44, mariuszzy wrote:

On 22-10-11 11:29, Alexander Peshkov wrote:

On 10/2/22 00:28, mariuszzy wrote:

Thank you Alexander for your support!

It has built cleanly and passes my basic tests. I'll give it more thorough tests Monday at work.

It has at least one issue (developer's assertion), I'm investigating it.

It has been deployed in one of our small shops (2-5 client apps) since Monday. Works like a charm so far.

Certainly I've fixed that assertion in final patch. BTW, it did not affect SS, only SC and in some rare cases CS.

AlexPeshkoff avatar Oct 11 '22 16:10 AlexPeshkoff

I'm running CS.

mariuszzy avatar Oct 11 '22 17:10 mariuszzy