cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Infinite(?) recursion in Basic2DGenericTopoClusterizer::buildTopoCluster() in WF 138.3

Open dan131riley opened this issue 3 years ago • 19 comments

We're getting segmentation faults in WF 138.3 that don't leave a stack trace, particularly in ASAN builds. With a partial debug build, I get a stack trace with over 17,000 stack frames:

#0  0x00007fb1d6b06de0 in edm::refitem::GetRefPtrImpl<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit, edm::refhelper::FindUsingAdvance<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit>, unsigned int>::getRefPtr_(edm::RefCore const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#1  0x00007fb1d6b0619d in reco::PFCluster::addRecHitFraction(reco::PFRecHitFraction const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#2  0x00007fb17c15b7ff in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2973, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:108
#3  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2972, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#4  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2971, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#5  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2970, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#6  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2969, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#7  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2968, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#8  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2967, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#9  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2966, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
(More stack frames follow...)
#17430 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53726, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17431 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53725, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17432 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53724, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17433 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53723, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17434 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53722, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17435 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53721, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17436 0x00007fb17c15aebb in Basic2DGenericTopoClusterizer::buildClusters (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, seedable=std::vector<bool> of length 74283, capacity 74304 = {...}, output=std::vector of length 3, capacity 4 = {...}) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:67
#17437 0x00007fb17c265574 in PFClusterProducer::produce (this=0x61800096ec80, e=..., es=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/PFClusterProducer.cc:146

dan131riley avatar Mar 26 '22 16:03 dan131riley

A new Issue was created by @dan131riley Dan Riley.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Mar 26 '22 16:03 cmsbuild

assign reconstruction

Dr15Jones avatar Mar 26 '22 22:03 Dr15Jones

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Mar 26 '22 22:03 cmsbuild

Thanks, I'm able to reproduce this with the latest ASAN build CMSSW_12_4_ASAN_X_2022-03-25-1100. I'm going to check CMSSW_12_4_ASAN_X_2022-03-21-1100 as well. EDIT: also reproducible in this release.

Are earlier ASAN builds available somewhere to check when the problem first appeared?

jpata avatar Mar 28 '22 13:03 jpata

We don't have earlier ASAN builds, as far as I know. My recollection is that WF 138.3 has been failing in ASAN for at least a few months, but there's previously been no followup because the failures never leave a stack trace.

dan131riley avatar Mar 28 '22 15:03 dan131riley

I looked into this a bit, and it seems buildTopoCluster recurses over all the rechits, which can easily be 50k or more. https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L121

This does not crash the normal release, but it does crash ASAN (possibly a canary in the coalmine). We should look for ways to limit the recursion stack depth here.

Adding some PF people here: @laurenhay @marksan87 @cms-sw/pf-l2

jpata avatar Mar 29 '22 12:03 jpata

a kind ping on this

jpata avatar Apr 29 '22 13:04 jpata

type pf

jpata avatar May 05 '22 14:05 jpata

WF 138.3 is beam splash events. Clustering the full calorimeter could be hard ;)

Are there any memory limits that these deep call stacks are hitting in ASAN?

slava77 avatar May 05 '22 15:05 slava77

It's probably hitting the stack size limit, and then timing out trying to generate the backtrace

dan131riley avatar May 05 '22 15:05 dan131riley

It is not infinite loop but https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L73-L123 is recursively called for over 74K times. For non-ASAN IBs this does not hit the stack size but for ASAN IBs (which uses 2-3 times more memory) it hits the limits and that is why it is crashing.

smuzaffar avatar May 11 '22 08:05 smuzaffar

@laurenhay @cms-sw/pf-l2 is the recursion over all rechits (splash or not) really meaningful here, or can perhaps it be split up across the event?

jpata avatar May 11 '22 08:05 jpata

@cms-sw/pf-l2 gentle ping on this issue.

iarspider avatar Aug 24 '23 09:08 iarspider

@cms-sw/pf-l2 gentle ping on this issue.

iarspider avatar Sep 28 '23 12:09 iarspider

From a very quick look, one thing that strikes me is the thresholds for preshower, which is set to 6e-05 GeV. Does this make sense @cms-sw/ecal-dpg-l2 @hatakeyamak ? https://github.com/cms-sw/cmssw/blob/CMSSW_13_2_X/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterPS_cfi.py#L28-L33 For other subdetectors the thresholds look reasonable, and recHits with very low energy gets rejected. But for preshower, it seems from a cout statement that I gave to check, that the algo even tries to cluster recHits of very very low energy like 0.00016608 GeV , 0.000168437 GeV etc. can this threshold for preshower be updated? This is mostly a question for ECAL DPG.

swagata87 avatar Sep 28 '23 14:09 swagata87

Thinking more about beam-splash, it is not guaranteed that PF clustering would work out-of-the box for beam-splashes too. PF clustering techniques are meant for real physics processes. If we want to run clustering also for beam-splashes, then maybe we can create a separate config (via an era perhaps?) and increase the seeding/gathering thresholds to very high values, so that buildTopoCluster is not called too many times.

As a test, the following changes help to avoid the reported crash (tested in CMSSW_13_3_ASAN_X_2023-09-27-2300):

--- a/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
+++ b/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py

     cms.PSet( detector = cms.string("ECAL_ENDCAP"),
-              seedingThreshold = cms.double(0.60),
-              seedingThresholdPt = cms.double(0.15)
+              seedingThreshold = cms.double(100), #0.60
+              seedingThresholdPt = cms.double(20) #0.15

     cms.PSet( detector = cms.string("ECAL_BARREL"),
-              seedingThreshold = cms.double(0.23),
-              seedingThresholdPt = cms.double(0.0)
+              seedingThreshold = cms.double(100), #0.23
+              seedingThresholdPt = cms.double(20)

     cms.PSet( detector = cms.string("ECAL_BARREL"),
-              gatheringThreshold = cms.double(0.08),
+              gatheringThreshold = cms.double(90), #0.08

     cms.PSet( detector = cms.string("ECAL_ENDCAP"),
-              gatheringThreshold = cms.double(0.3),
+              gatheringThreshold = cms.double(90), #0.3

Of course this would be a special setting only for beam-splash, assuming nobody is interested to reconstruct high-level physics objects in beam-splash events.

swagata87 avatar Oct 01 '23 06:10 swagata87

@cms-sw/ecal-dpg-l2 @hatakeyamak please comment on the solution proposed in https://github.com/cms-sw/cmssw/issues/37362#issuecomment-1741982958

iarspider avatar Jan 23 '24 08:01 iarspider

I was traveling last week with not in good condition, but I am back. Probably for now some cutoff for topo-cluster for safe guard could make sense.

hatakeyamak avatar Jan 29 '24 13:01 hatakeyamak

Is there something pending to close this issue?

stahlleiton avatar Jun 24 '24 05:06 stahlleiton

The issues is still not fixed, see e.g. CMSSW_15_0_UBSAN_X_2025-02-03-2300

Running locally under GDB I get this:

<...>
Thread 3 "cmsRun" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff171d0700 (LWP 3245048)]
0x00007ffe123c456e in edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::product (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:71
71	src/DataFormats/Common/interface/Handle.h: No such file or directory.
(gdb) bt
#0  0x00007ffe123c456e in edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::product (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:71
#1  edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::operator* (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:81
#2  Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x7ffe17c05e40, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=24485, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=..., hcalCuts=0x0)
    at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:84
#3  0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
#4  0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
<... ... ...>
#6831 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
#6832 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
<at this point I got tired of waiting for the full stack to unwind>
^C#6833 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>Quit

iarspider avatar Feb 04 '25 11:02 iarspider

@cms-sw/ecal-dpg-l2 the issue is still not fixed - happened in CMSSW_15_1_ASAN_X_2025-07-22-1100 RelVal 138.3

iarspider avatar Jul 23 '25 08:07 iarspider

The preshower rechit energy peaks around 1.2e-4 GeV so the seeding threshold seems not unreasonable.

Increasing the ECAL thresholds for beam splashes as @swagata87 suggested earlier looks like an acceptable workaround. For this type of events basically all active ECAL channels produce rechits and clustering probably does not make much sense.

thomreis avatar Jul 30 '25 12:07 thomreis

@thomreis @swagata87 : I implemented the suggestion of raising the ECAL thresholds for beam splashes in https://github.com/cms-sw/cmssw/pull/48663

stahlleiton avatar Aug 03 '25 19:08 stahlleiton