Infinite(?) recursion in Basic2DGenericTopoClusterizer::buildTopoCluster() in WF 138.3
We're getting segmentation faults in WF 138.3 that don't leave a stack trace, particularly in ASAN builds. With a partial debug build, I get a stack trace with over 17,000 stack frames:
#0 0x00007fb1d6b06de0 in edm::refitem::GetRefPtrImpl<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit, edm::refhelper::FindUsingAdvance<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit>, unsigned int>::getRefPtr_(edm::RefCore const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#1 0x00007fb1d6b0619d in reco::PFCluster::addRecHitFraction(reco::PFRecHitFraction const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#2 0x00007fb17c15b7ff in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2973, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:108
#3 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2972, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#4 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2971, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#5 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2970, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#6 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2969, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#7 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2968, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#8 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2967, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#9 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2966, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
(More stack frames follow...)
#17430 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53726, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17431 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53725, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17432 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53724, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17433 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53723, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17434 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53722, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17435 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53721, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17436 0x00007fb17c15aebb in Basic2DGenericTopoClusterizer::buildClusters (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, seedable=std::vector<bool> of length 74283, capacity 74304 = {...}, output=std::vector of length 3, capacity 4 = {...}) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:67
#17437 0x00007fb17c265574 in PFClusterProducer::produce (this=0x61800096ec80, e=..., es=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/PFClusterProducer.cc:146
A new Issue was created by @dan131riley Dan Riley.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction
New categories assigned: reconstruction
@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
Thanks, I'm able to reproduce this with the latest ASAN build CMSSW_12_4_ASAN_X_2022-03-25-1100. I'm going to check CMSSW_12_4_ASAN_X_2022-03-21-1100 as well. EDIT: also reproducible in this release.
Are earlier ASAN builds available somewhere to check when the problem first appeared?
We don't have earlier ASAN builds, as far as I know. My recollection is that WF 138.3 has been failing in ASAN for at least a few months, but there's previously been no followup because the failures never leave a stack trace.
I looked into this a bit, and it seems buildTopoCluster recurses over all the rechits, which can easily be 50k or more.
https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L121
This does not crash the normal release, but it does crash ASAN (possibly a canary in the coalmine). We should look for ways to limit the recursion stack depth here.
Adding some PF people here: @laurenhay @marksan87 @cms-sw/pf-l2
a kind ping on this
type pf
WF 138.3 is beam splash events. Clustering the full calorimeter could be hard ;)
Are there any memory limits that these deep call stacks are hitting in ASAN?
It's probably hitting the stack size limit, and then timing out trying to generate the backtrace
It is not infinite loop but https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L73-L123 is recursively called for over 74K times. For non-ASAN IBs this does not hit the stack size but for ASAN IBs (which uses 2-3 times more memory) it hits the limits and that is why it is crashing.
@laurenhay @cms-sw/pf-l2 is the recursion over all rechits (splash or not) really meaningful here, or can perhaps it be split up across the event?
@cms-sw/pf-l2 gentle ping on this issue.
@cms-sw/pf-l2 gentle ping on this issue.
From a very quick look, one thing that strikes me is the thresholds for preshower, which is set to 6e-05 GeV. Does this make sense @cms-sw/ecal-dpg-l2 @hatakeyamak ? https://github.com/cms-sw/cmssw/blob/CMSSW_13_2_X/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterPS_cfi.py#L28-L33 For other subdetectors the thresholds look reasonable, and recHits with very low energy gets rejected. But for preshower, it seems from a cout statement that I gave to check, that the algo even tries to cluster recHits of very very low energy like 0.00016608 GeV , 0.000168437 GeV etc. can this threshold for preshower be updated? This is mostly a question for ECAL DPG.
Thinking more about beam-splash, it is not guaranteed that PF clustering would work out-of-the box for beam-splashes too. PF clustering techniques are meant for real physics processes. If we want to run clustering also for beam-splashes, then maybe we can create a separate config (via an era perhaps?) and increase the seeding/gathering thresholds to very high values, so that buildTopoCluster is not called too many times.
As a test, the following changes help to avoid the reported crash (tested in CMSSW_13_3_ASAN_X_2023-09-27-2300):
--- a/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
+++ b/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
cms.PSet( detector = cms.string("ECAL_ENDCAP"),
- seedingThreshold = cms.double(0.60),
- seedingThresholdPt = cms.double(0.15)
+ seedingThreshold = cms.double(100), #0.60
+ seedingThresholdPt = cms.double(20) #0.15
cms.PSet( detector = cms.string("ECAL_BARREL"),
- seedingThreshold = cms.double(0.23),
- seedingThresholdPt = cms.double(0.0)
+ seedingThreshold = cms.double(100), #0.23
+ seedingThresholdPt = cms.double(20)
cms.PSet( detector = cms.string("ECAL_BARREL"),
- gatheringThreshold = cms.double(0.08),
+ gatheringThreshold = cms.double(90), #0.08
cms.PSet( detector = cms.string("ECAL_ENDCAP"),
- gatheringThreshold = cms.double(0.3),
+ gatheringThreshold = cms.double(90), #0.3
Of course this would be a special setting only for beam-splash, assuming nobody is interested to reconstruct high-level physics objects in beam-splash events.
@cms-sw/ecal-dpg-l2 @hatakeyamak please comment on the solution proposed in https://github.com/cms-sw/cmssw/issues/37362#issuecomment-1741982958
I was traveling last week with not in good condition, but I am back. Probably for now some cutoff for topo-cluster for safe guard could make sense.
Is there something pending to close this issue?
The issues is still not fixed, see e.g. CMSSW_15_0_UBSAN_X_2025-02-03-2300
Running locally under GDB I get this:
<...>
Thread 3 "cmsRun" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff171d0700 (LWP 3245048)]
0x00007ffe123c456e in edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::product (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:71
71 src/DataFormats/Common/interface/Handle.h: No such file or directory.
(gdb) bt
#0 0x00007ffe123c456e in edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::product (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:71
#1 edm::Handle<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> > >::operator* (this=0x7fff171ca7f0) at src/DataFormats/Common/interface/Handle.h:81
#2 Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x7ffe17c05e40, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=24485, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=..., hcalCuts=0x0)
at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:84
#3 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
#4 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
<... ... ...>
#6831 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
#6832 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>) at src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:135
<at this point I got tired of waiting for the full stack to unwind>
^C#6833 0x00007ffe123c6023 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=<optimized out>, input=..., rechitMask=..., kcell=<optimized out>, used=..., topocluster=..., hcalCuts=<optimized out>Quit
@cms-sw/ecal-dpg-l2 the issue is still not fixed - happened in CMSSW_15_1_ASAN_X_2025-07-22-1100 RelVal 138.3
The preshower rechit energy peaks around 1.2e-4 GeV so the seeding threshold seems not unreasonable.
Increasing the ECAL thresholds for beam splashes as @swagata87 suggested earlier looks like an acceptable workaround. For this type of events basically all active ECAL channels produce rechits and clustering probably does not make much sense.
@thomreis @swagata87 : I implemented the suggestion of raising the ECAL thresholds for beam splashes in https://github.com/cms-sw/cmssw/pull/48663