HLT crash in run392669: assertion failed in `TriggerRefsCollections`
On May 28, 2025, during run 392669 (pp collisions, Era 2025C, online release CMSSW_15_0_6) we got an isolated HLT crash e-log involving the following assertion failure:
cmsRun: src/DataFormats/HLTReco/interface/TriggerRefsCollections.h:642: trigger::size_type trigger::TriggerRefsCollections::addObjects(const trigger::Vids&, const trigger::VRl1tp2gtcand&): Assertion `ids.size() == refs.size()' failed.
A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
Which gets triggered here:
https://github.com/cms-sw/cmssw/blob/c62b73b5a3c48833e7267678b13dcbaf74ff8c83/DataFormats/HLTReco/interface/TriggerRefsCollections.h#L641-L646
The full log from F3 Mon is attached to the thread:
f3mon_logtable_2025-05-28T16_39_59.056Z.txt
We will try to reproduce once error files are available.
assign DataFormats/HLTReco
New categories assigned: hlt
@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @mmusich.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
This seems to be about using L1T Phase2 GT candidates - why? Are we already mixing in Phase2 L1T code?
The FED raw data files for the error stream are available at /store/group/tsg/FOG/error_stream/run392669
Repacked error streamer files are available at /store/group/tsg/FOG/error_stream_root/run392669/
I tried to reproduce (unsuccessfully) with:
#!/bin/bash -ex
# cmsrel CMSSW_15_0_6
# cd CMSSW_15_0_6/src
# cmsenv
# Get the run number from the first argument
RUN_NUMBER=392669
# Define the directory with the given run number
DIR="/store/group/tsg/FOG/error_stream_root/run${RUN_NUMBER}/"
# Generate a comma-separated list of the full file paths
file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)
# Print the result
echo "$file_list"
export MALLOC_CONF=junk:true
hltGetConfiguration run:$RUN_NUMBER \
--globaltag 150X_dataRun3_HLT_v1 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input $file_list > hlt_${RUN_NUMBER}.py
cat <<@EOF >> hlt_${RUN_NUMBER}.py
process.GlobalTag.recordsToDebug = cms.untracked.vstring()
process.options.wantSummary = True
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
# Run cmsRun with the generated configuration
cmsRun hlt_${RUN_NUMBER}.py &> hlt_${RUN_NUMBER}.log
If it helps, here is the full log file of the HLT job that crashed online.
Some notes while investigating this crash.
I can reproduce it using the configuration provided above about 10% of the times.
The crash can be reproduced:
- running on GPUs or only on CPUs;
- running with a single thread, single stream;
- running only on the 2nd file,
run392669_ls0110_index000333_fu-c2b02-06-01_pid3861299.root.
Most of the time the crash happens during or shortly after the 78th event in that file. Sometimes the crash happens after the end of the job, while going through the global destructors.
The crash can be reproduced:
I've tried with the following script on lxplus8-gpu:
#!/bin/bash -ex
# Get the run number from the first argument
RUN_NUMBER=392669
# Define the directory with the given run number
DIR="/store/group/tsg/FOG/error_stream_root/run${RUN_NUMBER}"
# Generate a comma-separated list of the full file paths
#file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)
file_list="$DIR/run392669_ls0110_index000333_fu-c2b02-06-01_pid3861299.root"
# Print the result
echo "$file_list"
hltGetConfiguration run:$RUN_NUMBER \
--globaltag 150X_dataRun3_HLT_v1 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input $file_list > hlt_${RUN_NUMBER}.py
cat <<@EOF >> hlt_${RUN_NUMBER}.py
process.GlobalTag.recordsToDebug = cms.untracked.vstring()
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
# Run cmsRun with the generated configuration
for i in $(seq -w 1 100); do
echo "Starting run $i..."
log_file="log_${i}.txt"
cmsRun hlt_${RUN_NUMBER}.py > "$log_file" 2>&1
status=$?
echo "Run $i finished with exit code $status" >> "$log_file"
echo "Finished run $i (exit code $status)"
done
It did crash at the 60th trial with a segmentation fault. I report below the stack trace:
Thread 1 (Thread 0x7fcd2942f580 (LWP 2277161) "cmsRun"):
#0 0x00007fcd2aaadac1 in poll () from /lib64/libc.so.6
#1 0x00007fcd23056147 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007fcd23056344 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007fcd2b97512b in edm::PuttableProductResolver::resolveProduct_(edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#5 0x00007fcd2b9634c5 in edm::Principal::getByToken(edm::KindOfType, edm::TypeID const&, unsigned int, bool, bool&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#6 0x00007fcd2b9711ac in edm::PrincipalGetAdapter::getByToken_(edm::TypeID const&, edm::KindOfType, edm::EDGetToken, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7 0x00007fcc82eae831 in void edm::GetterOfProducts<trigger::TriggerFilterObjectWithRefs>::fillHandles<edm::Event>(edm::Event const&, std::vector<edm::Handle<trigger::TriggerFilterObjectWithRefs>, std::allocator<edm::Handle<trigger::TriggerFilterObjectWithRefs> > >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginHLTriggerHLTcorePlugins.so
#8 0x00007fcc82ee3a0c in TriggerSummaryProducerRAW::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginHLTriggerHLTcorePlugins.so
#9 0x00007fcd2b9e2e15 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fcd2b9dc2fc in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007fcd2b95e589 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007fcd2b95ea91 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007fcd2bbe52a8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#14 0x00007fcd2bb433e1 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fcd27e4be00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fcd27e4be00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fcd2b8dcc1b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#18 0x00007fcd2b8eac01 in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007fcd2b8eb0b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000000000408556 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fcd2bb2f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x000000000040a283 in main::{lambda()#1}::operator()() const ()
#23 0x00000000004051b8 in main ()
the full job log is here.
The symptoms being reported are what I'd expect if there was a random memory overwrite. Would it be possible for someone to try running the job under valgrind?
Here's a log from running under valgrind: valgrind.log.
I'll try to re-run with debug symbols (when the sysadmins stop rebooting my work machine...).
Would it be possible for someone to try running the job under valgrind?
incidentally, this looks also similar to the symptoms of https://github.com/cms-sw/cmssw/issues/48157 (for which I had already run valgrind https://github.com/cms-sw/cmssw/issues/48157#issuecomment-2909732499.
From the valgrind log, this one doesn't look good
Begin processing the 78th record. Run 392669, Event 165720564, LumiSection 110 on stream 0 at 04-Jun-2025 18:29:24.559 CEST
==457520== Invalid read of size 16
==457520== at 0xC3EBC710: void std::__introsort_loop<__gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, __gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, long, __gnu_cxx::__ops::_Iter_less_iter) [clone .constprop.0] (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)
==457520== by 0xC3EA1764: CTPPSPixelRecHitProducer::produce(edm::Event&, edm::EventSetup const&) (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)
[cut]
==457520== Address 0x141685370 is 0 bytes after a block of size 1,408 alloc'd
==457520== at 0x403BEE1: operator new(unsigned long) (in /data/cmssw/el8_amd64_gcc12/external/valgrind/3.24.0-f85a1303334507f502ed8242a93c05bd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==457520== by 0xC3EA1425: CTPPSPixelRecHitProducer::produce(edm::Event&, edm::EventSetup const&) (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)
So from looking at the code, I THINK the valgrind error comes from https://github.com/cms-sw/cmssw/blob/f79df5782941b39308cf042d779265a765c51b08/DataFormats/Common/interface/DetSetVector.h#L343
since the std::__introsort_loop<....<CTPPSPixelRecHit*...>> seems related to sorting of a std::vector<CTPPSPixelRecHit> which is the .data of a DetSet< CTPPSPixelRecHit>.
I'm wondering if the sorting criteria for a CTPPSPixelRecHit has gone bad , i.e. some value used in the sort is not physical. The sort is
https://github.com/cms-sw/cmssw/blob/f79df5782941b39308cf042d779265a765c51b08/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L70
and if the value for point().mag() where a nan than sorting would go crazy.
I'm wondering if the sorting criteria for a
CTPPSPixelRecHithas gone bad , i.e. some value used in the sort is not physical. The sort iscmssw/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
Line 70 in f79df57
inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) { return (a.point().mag() < b.point().mag()); }; and if the value for
point().mag()where ananthan sorting would go crazy.
The sort can also go haywire if the comparison operation isn't stable. We've had cases where a complicated calculation in the comparison operator wasn't completely consistent, which caused problems even with a stable sort. In general it's best to pre-compute the values to be sorted on, to avoid any instability and reduce the n*log(n) evaluations to n. Somewhat less ambitious, it might help to use mag2() instead of mag(), avoiding the sqrt().
OK, I can reproduce the crash running only the path HLT_DiPFJetAve180_PPSMatch_Xi0p3_QuadJet_Max2ProtPerRP_v8.
FYI @mmusich https://github.com/cms-sw/cmssw/pull/48235 does not seem to solve the issue :-/
Indeed, adding an assertion fails on the incriminated event:
cmsRun: src/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:72: bool operator<(CTPPSPixelRecHit&, CTPPSPixelRecHit&): Assertion `edm::isFinite(a.point().mag2())' failed.
The weird thing is that this happens only some of the times...
Even if I run only the PPS local reconstruction:
process.path = cms.Path(
process.hltCTPPSPixelDigis +
process.hltCTPPSPixelClusters +
process.hltCTPPSPixelRecHits
)
I get the assertion failure only some of the times.
I wonder if we are getting a memory overwrite based on an uninitialized value.
unsigned long l = ~0UL;
assert( std::isnan(*reinterpret_cast<double*>(&l)) );
passes. In fact a lot of weird bit patterns are interpreted as nan.
In fact a lot of weird bit patterns are interpreted as nan.
A NaN has exponent all 1s and the significand/mantissa non-zero (zero mantissa is infinity), so there are a lot of bit patterns that are interpreted as a NaN. It might be interesting to enable raising floating point exceptions on NaN to see where it comes from (it might also be useless if we generate a lot of them).
So the calculation for the hits location is done here
https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L105-L112
where the used values are calculated here https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L92-L94
and https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L83-L87
with the values being passed into the routine https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L20-L22
one item of note, aCluster is a copy of the CTPPSPixelCluster passed into the routine.
So assuming it is bad values going into the calculation of the LocalPoint (rather than some how the std::vector holding the hits had a problem) then the only place I could see that could have had a bad floating point value is in the PPSPixelTopology which presumably has had its values pulled during the earlier 77 events. So if it is a floating point value in PPSPixelTopology that went bad than something would have had to step on it (i.e. an out of bound memory write).
So one test is at line 112 of RecoPPS/Local/src/RPixClusterToHit.cc where LocalPoint lp is made, are the values already bad? If so, which one? Then backtrack to see if the value is bad coming out of the topology call.
One observation in favour of the hypothesis that the corruption happens earlier than the affected event:
- if I do
skipEvents = 77and start processing from event 78, I do not seem to get any errors; - if I do
skipEvents = 76and start processing from event 77, I often (not always) get an assertion during event 78, when theDetSetVector<CTPPSPixelRecHit>is being put into the event:bool operator<(CTPPSPixelRecHit&, CTPPSPixelRecHit&): Assertion `edm::isFinite(a.point().mag2())' failed
However if I check and print the mag2() of all the rechits, I don't get any NaNs or seemingly invalid values.
In fact, the values are the same when the job crashes and when it doesn't ... 🤷🏻
Weird findings. Did you print mag2() before or after the insertion into the vector?
Right before.
So one test is at line 112 of RecoPPS/Local/src/RPixClusterToHit.cc where LocalPoint lp is made, are the values already bad? If so, which one? Then backtrack to see if the value is bad coming out of the topology call.
Apparently they aren't. Using this patch:
diff --git a/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h b/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
index cc51dda9816..33270076def 100644
--- a/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
+++ b/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
@@ -11,6 +11,10 @@
#include "DataFormats/GeometrySurface/interface/LocalError.h"
#include "DataFormats/GeometryVector/interface/LocalPoint.h"
+#include "FWCore/Utilities/interface/isFinite.h"
+
+#include <iostream>
+#include <cstdlib> // for std::abort
// Reconstructed hits in CTPPS Pixel detector
@@ -67,6 +71,20 @@ private:
unsigned int clusterSizeCol_;
};
-inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) { return (a.point().mag() < b.point().mag()); };
+inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) {
+ //assert(edm::isFinite(a.point().mag2()));
+ //assert(edm::isFinite(b.point().mag2()));
+
+ if (!edm::isFinite(a.point().mag2())) {
+ std::cerr << "Non-finite a.point().mag2(): " << a.point().mag2() << std::endl;
+ std::abort();
+ }
+ if (!edm::isFinite(b.point().mag2())) {
+ std::cerr << "Non-finite b.point().mag2(): " << b.point().mag2() << std::endl;
+ std::abort();
+ }
+
+ return (a.point().mag() < b.point().mag());
+};
#endif
diff --git a/RecoPPS/Local/src/RPixClusterToHit.cc b/RecoPPS/Local/src/RPixClusterToHit.cc
index 4171bbc8d5c..bb3981ba976 100644
--- a/RecoPPS/Local/src/RPixClusterToHit.cc
+++ b/RecoPPS/Local/src/RPixClusterToHit.cc
@@ -1,4 +1,5 @@
#include "RecoPPS/Local/interface/RPixClusterToHit.h"
+#include "FWCore/Utilities/interface/isFinite.h"
RPixClusterToHit::RPixClusterToHit(edm::ParameterSet const &conf) {
verbosity_ = conf.getUntrackedParameter<int>("RPixVerbosity");
@@ -109,6 +110,9 @@ void RPixClusterToHit::make_hit(CTPPSPixelCluster aCluster,
double varianceX = weightedVarianceX * invWeights * invWeights;
double varianceY = weightedVarianceY * invWeights * invWeights;
+ assert(edm::isFinite(avgLocalX));
+ assert(edm::isFinite(avgLocalY));
+
LocalPoint lp(avgLocalX, avgLocalY, 0);
LocalError le(varianceX, 0, varianceY);
if (verbosity_)
it gets aborted at event 78 with:
Non-finite a.point().mag2(): inf
A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
type ctpps
assign DataFormats/CTPPSReco