cmssw
cmssw copied to clipboard
[ARM] Assertion failure in gpuVertexFinder::fitVertices
Some workflows in CMSSW_12_6_X_2022-08-30-2300 (el8_aarch64_gcc10 architecture) are failing with an assertion failure
cmsRun: /data/cmsbuild/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/1bb12fb872dfffedae97ef17ca2c849f/opt/cmssw/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73: void gpuVertexFinder::fitVertices(gpuVertexFinder::ZVertices*, gpuVertexFinder::WorkSpace*, float): Assertion `wv[i] > 0.f' failed.
A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
(...)
Thread 5 (Thread 0x4000b7e99280 (LWP 1002951) "cmsRun"):
#0 0x00004000088cb4f4 in poll () from /lib64/libc.so.6
#1 0x000040000aaf5558 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x000040000aaf5fd4 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x000040000aaf8960 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000400008836764 in raise () from /lib64/libc.so.6
#6 0x00004000088209ac in abort () from /lib64/libc.so.6
#7 0x00004000088300a4 in __assert_fail_base () from /lib64/libc.so.6
#8 0x0000400008830110 in __assert_fail () from /lib64/libc.so.6
#9 0x00004000b8f20edc in gpuVertexFinder::Producer::make(TrackSoAHeterogeneousT<32768> const*, float, float) const () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#10 0x00004000b8f16100 in PixelVertexProducerCUDA::produceOnCPU(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#11 0x000040000692e9f0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#12 0x000040000692573c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
Full log here.
A new Issue was created by @iarspider .
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reco
assign reconstruction
New categories assigned: reconstruction
@jpata,@clacaputo,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
it would be nice to see the line numbers in the crash reports, especially in cases if it's hard to access the node with a specific architecture (ARM)
@slava77
Thread 1 (Thread 0xffff9f75b730 (LWP 3840103) "cmsRun"):
#0 0x0000ffff9f31b4f4 in poll () from /lib64/libc.so.6
#1 0x0000ffff9d035558 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x0000ffff9d035fd4 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x0000ffff9d038960 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000ffff9f286764 in raise () from /lib64/libc.so.6
#6 0x0000ffff9f2709ac in abort () from /lib64/libc.so.6
#7 0x0000ffff9f2800a4 in __assert_fail_base () from /lib64/libc.so.6
#8 0x0000ffff9f280110 in __assert_fail () from /lib64/libc.so.6
#9 0x0000fffef0e70edc in gpuVertexFinder::fitVertices (chi2Max=<optimized out>, pws=<optimized out>, pdata=<optimized out>) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73
#10 gpuVertexFinder::Producer::make (this=<optimized out>, tksoa=<optimized out>, ptMin=<optimized out>, ptMax=<optimized out>) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinder.cc:179
#11 0x0000fffef0e66100 in PixelVertexProducerCUDA::produceOnCPU (this=<optimized out>, streamID=..., iEvent=..., iSetup=...) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/PixelVertexProducerCUDA.cc:133
#12 0x0000ffffa146e9f0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
full log in /afs/cern.ch/user/c/cmsbuild/public/step2_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano+ALCA.log
@VinInn @fwyzard
what do atomicAdd_block
, __syncthreads
and such translate to without CUDA for the CPU code in
https://github.com/cms-sw/cmssw/blob/ca8617f5f5948dde414285b6b72a2aad132fd6d1/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h#L66-L76
?
hopefully the atomicAdd_block
is still an increment
what do
atomicAdd_block
,__syncthreads
and such translate to without CUDA for the CPU code in
atomicAdd_block()
is indeed an addition (just without any atomicity)
https://github.com/cms-sw/cmssw/blob/b80814f7f65fdf28d7b684b95d1ca576bbf93ddb/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h#L60-L70
and __syncthreads()
is a no-op
https://github.com/cms-sw/cmssw/blob/b80814f7f65fdf28d7b684b95d1ca576bbf93ddb/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h#L132
(which should be fine since the cudacompat
model is to run the algorithm serially)
Just noticed this a duplicate of #37820. @cms-sw/reconstruction-l2 do you have preference which one to keep?
@makortel Sorry for the belated response. I suggest we close this one, and continue with #37820
@cmsbuild, please close