omr-agentcore icon indicating copy to clipboard operation
omr-agentcore copied to clipboard

Intermittent z/OS crash in libhealthenter.so when stopping

Open kgibm opened this issue 4 years ago • 3 comments
trafficstars

IPCS ip verbx ledata 'nthreads(*)' shows a crash in WorkerThread::processLoop:

18    abort       HLE77C0:edcabort.c                                     
19    masterSynchSignalHandler                                           
                  j20200901                                              
20    __zerro     HLE77C0:edczerro.c                                     
21    __zerros    HLE77C0:edczerro.c                                     
27    ** NoName **.......................c..F....-....WorkerThread.cpp...
28    ** NoName **.......................c..F....-....WorkerThread.cpp...
29    ** NoName **.......................c..F....-.&..Thread.cpp...UI4349

Or Semaphore::open:

18    abort       HLE77C0:edcabort.c                                     
19    mainSynchSignalHandler                                             
                  j20201102                                              
20    __zerro     HLE77C0:edczerro.c                                     
21    __zerros    HLE77C0:edczerro.c                                     
27    ibmras::common::port::Semaphore::open(int*)                        
                  .......................c..F.b..-... ....Thread.cpp...D2
28    ibmras::common::port::Semaphore::wait(unsigned int)                
                  .......................c..F.b..-... ....Thread.cpp...D2
29    ibmras::monitoring::agent::threads::WorkerThread::processLoo       
                  .......................c..F.b..-... .-..WorkerThread.cpp
30    ibmras::monitoring::agent::threads::WorkerThread::threadEntr       
                  .......................c..F.b..-... .-..WorkerThread.cpp

kgibm avatar Jan 26 '21 21:01 kgibm

The problem was caused by ThreadPool::stopAll destructing the WorkerThread while it was still running in processLoop. This implicitly destructed the Semaphore which implicitly destructed its fields like name and led to undefined behavior which drove the crash.

The solution is to comment out the setting of stopped=true in WorkerThread::stop. This function already sets running=false which will cause WorkerThread::processLoop to break the next time it comes back from the semaphore and then will set stopped=true at the end of WorkerThread::processLoop.

I was able to consistently reproduce the problem before, but after commenting out stopped=true in WorkerThread::stop, I can no longer reproduce the problem.

kgibm avatar Jan 26 '21 21:01 kgibm

Created PR https://github.com/RuntimeTools/omr-agentcore/pull/100

kgibm avatar Jan 26 '21 21:01 kgibm

An additional symptom of this in jdmpview will show something like the following as frames at the top of the crash stack (particularly the WorkerThread symbol):

bp: 0x000000517faff180 pc: 0x000000003465a940 /prd/link/wlp/wlp/E4_BMIS/lib/native/zos/s390x/../../../../java/8.0/lib/s390x/libhealthcenter.so::threadEntry__Q5_6ibmras10monitoring5agent7threadsEI12WorkerThreadFPQ4_6ibmras6common4port10ThreadData+0x20
bp: 0x000000517faff280 pc: 0x000000003460bd50 /prd/link/wlp/wlp/E4_BMIS/lib/native/zos/s390x/../../../../java/8.0/lib/s390x/libhealthcenter.so::wrapper+0x60

kgibm avatar Feb 16 '21 15:02 kgibm