cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

[ARM] Assertion failure in gpuVertexFinder::fitVertices

Open iarspider opened this issue 2 years ago • 9 comments

Some workflows in CMSSW_12_6_X_2022-08-30-2300 (el8_aarch64_gcc10 architecture) are failing with an assertion failure

cmsRun: /data/cmsbuild/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/1bb12fb872dfffedae97ef17ca2c849f/opt/cmssw/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73: void gpuVertexFinder::fitVertices(gpuVertexFinder::ZVertices*, gpuVertexFinder::WorkSpace*, float): Assertion `wv[i] > 0.f' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
(...)
Thread 5 (Thread 0x4000b7e99280 (LWP 1002951) "cmsRun"):
#0  0x00004000088cb4f4 in poll () from /lib64/libc.so.6
#1  0x000040000aaf5558 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x000040000aaf5fd4 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x000040000aaf8960 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000400008836764 in raise () from /lib64/libc.so.6
#6  0x00004000088209ac in abort () from /lib64/libc.so.6
#7  0x00004000088300a4 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000400008830110 in __assert_fail () from /lib64/libc.so.6
#9  0x00004000b8f20edc in gpuVertexFinder::Producer::make(TrackSoAHeterogeneousT<32768> const*, float, float) const () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#10 0x00004000b8f16100 in PixelVertexProducerCUDA::produceOnCPU(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/pluginRecoPixelVertexingPixelVertexFindingPlugins.so
#11 0x000040000692e9f0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so
#12 0x000040000692573c in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so

Full log here.

iarspider avatar Aug 31 '22 07:08 iarspider

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Aug 31 '22 07:08 cmsbuild

assign reco

iarspider avatar Aug 31 '22 07:08 iarspider

assign reconstruction

iarspider avatar Aug 31 '22 11:08 iarspider

New categories assigned: reconstruction

@jpata,@clacaputo,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Aug 31 '22 11:08 cmsbuild

it would be nice to see the line numbers in the crash reports, especially in cases if it's hard to access the node with a specific architecture (ARM)

slava77 avatar Aug 31 '22 12:08 slava77

@slava77

Thread 1 (Thread 0xffff9f75b730 (LWP 3840103) "cmsRun"):
#0  0x0000ffff9f31b4f4 in poll () from /lib64/libc.so.6
#1  0x0000ffff9d035558 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x0000ffff9d035fd4 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x0000ffff9d038960 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/el8_aarch64_gcc10/cms/cmssw-patch/CMSSW_12_6_X_2022-08-30-2300/lib/el8_aarch64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff9f286764 in raise () from /lib64/libc.so.6
#6  0x0000ffff9f2709ac in abort () from /lib64/libc.so.6
#7  0x0000ffff9f2800a4 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000ffff9f280110 in __assert_fail () from /lib64/libc.so.6
#9  0x0000fffef0e70edc in gpuVertexFinder::fitVertices (chi2Max=<optimized out>, pws=<optimized out>, pdata=<optimized out>) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h:73
#10 gpuVertexFinder::Producer::make (this=<optimized out>, tksoa=<optimized out>, ptMin=<optimized out>, ptMax=<optimized out>) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinder.cc:179
#11 0x0000fffef0e66100 in PixelVertexProducerCUDA::produceOnCPU (this=<optimized out>, streamID=..., iEvent=..., iSetup=...) at /data/cmsbuild/razumov/CMSSW_12_6_X_2022-08-30-2300/src/RecoPixelVertexing/PixelVertexFinding/plugins/PixelVertexProducerCUDA.cc:133
#12 0x0000ffffa146e9f0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02748/el8_aarch64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-08-29-2300/lib/el8_aarch64_gcc10/libFWCoreFramework.so

full log in /afs/cern.ch/user/c/cmsbuild/public/step2_TTbar_14TeV+2021_0T+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano+ALCA.log

iarspider avatar Aug 31 '22 14:08 iarspider

@VinInn @fwyzard what do atomicAdd_block , __syncthreads and such translate to without CUDA for the CPU code in https://github.com/cms-sw/cmssw/blob/ca8617f5f5948dde414285b6b72a2aad132fd6d1/RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h#L66-L76 ? hopefully the atomicAdd_block is still an increment

slava77 avatar Aug 31 '22 15:08 slava77

what do atomicAdd_block , __syncthreads and such translate to without CUDA for the CPU code in

atomicAdd_block() is indeed an addition (just without any atomicity) https://github.com/cms-sw/cmssw/blob/b80814f7f65fdf28d7b684b95d1ca576bbf93ddb/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h#L60-L70

and __syncthreads() is a no-op https://github.com/cms-sw/cmssw/blob/b80814f7f65fdf28d7b684b95d1ca576bbf93ddb/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h#L132

(which should be fine since the cudacompat model is to run the algorithm serially)

makortel avatar Aug 31 '22 16:08 makortel

Just noticed this a duplicate of #37820. @cms-sw/reconstruction-l2 do you have preference which one to keep?

makortel avatar Oct 03 '22 14:10 makortel

@makortel Sorry for the belated response. I suggest we close this one, and continue with #37820

mandrenguyen avatar Dec 15 '22 15:12 mandrenguyen

@cmsbuild, please close

makortel avatar Dec 15 '22 15:12 makortel