omr-agentcore
omr-agentcore copied to clipboard
Intermittent z/OS crash in libhealthenter.so when stopping
IPCS ip verbx ledata 'nthreads(*)' shows a crash in WorkerThread::processLoop:
18 abort HLE77C0:edcabort.c
19 masterSynchSignalHandler
j20200901
20 __zerro HLE77C0:edczerro.c
21 __zerros HLE77C0:edczerro.c
27 ** NoName **.......................c..F....-....WorkerThread.cpp...
28 ** NoName **.......................c..F....-....WorkerThread.cpp...
29 ** NoName **.......................c..F....-.&..Thread.cpp...UI4349
Or Semaphore::open:
18 abort HLE77C0:edcabort.c
19 mainSynchSignalHandler
j20201102
20 __zerro HLE77C0:edczerro.c
21 __zerros HLE77C0:edczerro.c
27 ibmras::common::port::Semaphore::open(int*)
.......................c..F.b..-... ....Thread.cpp...D2
28 ibmras::common::port::Semaphore::wait(unsigned int)
.......................c..F.b..-... ....Thread.cpp...D2
29 ibmras::monitoring::agent::threads::WorkerThread::processLoo
.......................c..F.b..-... .-..WorkerThread.cpp
30 ibmras::monitoring::agent::threads::WorkerThread::threadEntr
.......................c..F.b..-... .-..WorkerThread.cpp
The problem was caused by ThreadPool::stopAll destructing the WorkerThread while it was still running in processLoop. This implicitly destructed the Semaphore which implicitly destructed its fields like name and led to undefined behavior which drove the crash.
The solution is to comment out the setting of stopped=true in WorkerThread::stop. This function already sets running=false which will cause WorkerThread::processLoop to break the next time it comes back from the semaphore and then will set stopped=true at the end of WorkerThread::processLoop.
I was able to consistently reproduce the problem before, but after commenting out stopped=true in WorkerThread::stop, I can no longer reproduce the problem.
Created PR https://github.com/RuntimeTools/omr-agentcore/pull/100
An additional symptom of this in jdmpview will show something like the following as frames at the top of the crash stack (particularly the WorkerThread symbol):
bp: 0x000000517faff180 pc: 0x000000003465a940 /prd/link/wlp/wlp/E4_BMIS/lib/native/zos/s390x/../../../../java/8.0/lib/s390x/libhealthcenter.so::threadEntry__Q5_6ibmras10monitoring5agent7threadsEI12WorkerThreadFPQ4_6ibmras6common4port10ThreadData+0x20
bp: 0x000000517faff280 pc: 0x000000003460bd50 /prd/link/wlp/wlp/E4_BMIS/lib/native/zos/s390x/../../../../java/8.0/lib/s390x/libhealthcenter.so::wrapper+0x60