cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

HLT crash in run392669: assertion failed in `TriggerRefsCollections`

Open mmusich opened this issue 7 months ago • 61 comments

On May 28, 2025, during run 392669 (pp collisions, Era 2025C, online release CMSSW_15_0_6) we got an isolated HLT crash e-log involving the following assertion failure:

cmsRun: src/DataFormats/HLTReco/interface/TriggerRefsCollections.h:642: trigger::size_type trigger::TriggerRefsCollections::addObjects(const trigger::Vids&, const trigger::VRl1tp2gtcand&): Assertion `ids.size() == refs.size()' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Which gets triggered here:

https://github.com/cms-sw/cmssw/blob/c62b73b5a3c48833e7267678b13dcbaf74ff8c83/DataFormats/HLTReco/interface/TriggerRefsCollections.h#L641-L646

The full log from F3 Mon is attached to the thread:

f3mon_logtable_2025-05-28T16_39_59.056Z.txt

We will try to reproduce once error files are available.

mmusich avatar May 29 '25 07:05 mmusich

assign DataFormats/HLTReco

mmusich avatar May 29 '25 07:05 mmusich

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 29 '25 07:05 cmsbuild

cms-bot internal usage

cmsbuild avatar May 29 '25 07:05 cmsbuild

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar May 29 '25 07:05 cmsbuild

This seems to be about using L1T Phase2 GT candidates - why? Are we already mixing in Phase2 L1T code?

Martin-Grunewald avatar May 29 '25 10:05 Martin-Grunewald

The FED raw data files for the error stream are available at /store/group/tsg/FOG/error_stream/run392669 Repacked error streamer files are available at /store/group/tsg/FOG/error_stream_root/run392669/

I tried to reproduce (unsuccessfully) with:

#!/bin/bash -ex                                                                                                                                                                                             

# cmsrel CMSSW_15_0_6                                                                                                                                                                                       
# cd CMSSW_15_0_6/src                                                                                                                                                                                       
# cmsenv                                                                                                                                                                                                    

# Get the run number from the first argument                                                                                                                                                                
RUN_NUMBER=392669

# Define the directory with the given run number                                                                                                                                                            
DIR="/store/group/tsg/FOG/error_stream_root/run${RUN_NUMBER}/"

# Generate a comma-separated list of the full file paths                                                                                                                                                    
file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)

# Print the result                                                                                                                                                                                          
echo "$file_list"

export MALLOC_CONF=junk:true

hltGetConfiguration run:$RUN_NUMBER \
                    --globaltag 150X_dataRun3_HLT_v1 \
                    --data \
                    --no-prescale \
                    --no-output \
                    --max-events -1 \
                    --input $file_list  > hlt_${RUN_NUMBER}.py

cat <<@EOF >> hlt_${RUN_NUMBER}.py
process.GlobalTag.recordsToDebug = cms.untracked.vstring()
process.options.wantSummary = True
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

# Run cmsRun with the generated configuration                                                                                                                                                               
cmsRun hlt_${RUN_NUMBER}.py &> hlt_${RUN_NUMBER}.log

mmusich avatar May 29 '25 13:05 mmusich

If it helps, here is the full log file of the HLT job that crashed online.

old_hlt_run392669_pid3861299.log

missirol avatar Jun 02 '25 14:06 missirol

Some notes while investigating this crash.

I can reproduce it using the configuration provided above about 10% of the times.

The crash can be reproduced:

  • running on GPUs or only on CPUs;
  • running with a single thread, single stream;
  • running only on the 2nd file, run392669_ls0110_index000333_fu-c2b02-06-01_pid3861299.root.

Most of the time the crash happens during or shortly after the 78th event in that file. Sometimes the crash happens after the end of the job, while going through the global destructors.

fwyzard avatar Jun 04 '25 14:06 fwyzard

The crash can be reproduced:

I've tried with the following script on lxplus8-gpu:

#!/bin/bash -ex                                                                                                                                                                                             

# Get the run number from the first argument                                                                                                                                                                
RUN_NUMBER=392669

# Define the directory with the given run number                                                                                                                                                            
DIR="/store/group/tsg/FOG/error_stream_root/run${RUN_NUMBER}"

# Generate a comma-separated list of the full file paths                                                                                                                                                    
#file_list=$(ls "/eos/cms$DIR" | awk -v dir="$DIR" '{print dir $0}' | paste -sd "," -)                                                                                                                      
file_list="$DIR/run392669_ls0110_index000333_fu-c2b02-06-01_pid3861299.root"

# Print the result                                                                                                                                                                                          
echo "$file_list"

hltGetConfiguration run:$RUN_NUMBER \
                    --globaltag 150X_dataRun3_HLT_v1 \
                    --data \
                    --no-prescale \
                    --no-output \
                    --max-events -1 \
                    --input $file_list  > hlt_${RUN_NUMBER}.py

cat <<@EOF >> hlt_${RUN_NUMBER}.py
process.GlobalTag.recordsToDebug = cms.untracked.vstring()
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run cmsRun with the generated configuration                                                                                                                                                               
for i in $(seq -w 1 100); do
    echo "Starting run $i..."
    log_file="log_${i}.txt"
    cmsRun hlt_${RUN_NUMBER}.py > "$log_file" 2>&1
    status=$?
    echo "Run $i finished with exit code $status" >> "$log_file"
    echo "Finished run $i (exit code $status)"
done

It did crash at the 60th trial with a segmentation fault. I report below the stack trace:

Thread 1 (Thread 0x7fcd2942f580 (LWP 2277161) "cmsRun"):
#0  0x00007fcd2aaadac1 in poll () from /lib64/libc.so.6
#1  0x00007fcd23056147 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007fcd23056344 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fcd2b97512b in edm::PuttableProductResolver::resolveProduct_(edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#5  0x00007fcd2b9634c5 in edm::Principal::getByToken(edm::KindOfType, edm::TypeID const&, unsigned int, bool, bool&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#6  0x00007fcd2b9711ac in edm::PrincipalGetAdapter::getByToken_(edm::TypeID const&, edm::KindOfType, edm::EDGetToken, edm::ModuleCallingContext const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7  0x00007fcc82eae831 in void edm::GetterOfProducts<trigger::TriggerFilterObjectWithRefs>::fillHandles<edm::Event>(edm::Event const&, std::vector<edm::Handle<trigger::TriggerFilterObjectWithRefs>, std::allocator<edm::Handle<trigger::TriggerFilterObjectWithRefs> > >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginHLTriggerHLTcorePlugins.so
#8  0x00007fcc82ee3a0c in TriggerSummaryProducerRAW::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginHLTriggerHLTcorePlugins.so
#9  0x00007fcd2b9e2e15 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fcd2b9dc2fc in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007fcd2b95e589 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007fcd2b95ea91 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007fcd2bbe52a8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#14 0x00007fcd2bb433e1 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fcd27e4be00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fcd27e4be00) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fcd2b8dcc1b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#18 0x00007fcd2b8eac01 in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007fcd2b8eb0b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000000000408556 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#21 0x00007fcd2bb2f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-5e980a24ce71efc4677b8c6894a8a2c3/tbb-v2021.9.0/src/tbb/arena.cpp:688
#22 0x000000000040a283 in main::{lambda()#1}::operator()() const ()
#23 0x00000000004051b8 in main ()

the full job log is here.

mmusich avatar Jun 05 '25 10:06 mmusich

The symptoms being reported are what I'd expect if there was a random memory overwrite. Would it be possible for someone to try running the job under valgrind?

Dr15Jones avatar Jun 05 '25 13:06 Dr15Jones

Here's a log from running under valgrind: valgrind.log.

I'll try to re-run with debug symbols (when the sysadmins stop rebooting my work machine...).

fwyzard avatar Jun 05 '25 13:06 fwyzard

Would it be possible for someone to try running the job under valgrind?

incidentally, this looks also similar to the symptoms of https://github.com/cms-sw/cmssw/issues/48157 (for which I had already run valgrind https://github.com/cms-sw/cmssw/issues/48157#issuecomment-2909732499.

mmusich avatar Jun 05 '25 13:06 mmusich

From the valgrind log, this one doesn't look good

Begin processing the 78th record. Run 392669, Event 165720564, LumiSection 110 on stream 0 at 04-Jun-2025 18:29:24.559 CEST
==457520== Invalid read of size 16
==457520==    at 0xC3EBC710: void std::__introsort_loop<__gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, __gnu_cxx::__normal_iterator<CTPPSPixelRecHit*, std::vector<CTPPSPixelRecHit, std::allocator<CTPPSPixelRecHit> > >, long, __gnu_cxx::__ops::_Iter_less_iter) [clone .constprop.0] (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)
==457520==    by 0xC3EA1764: CTPPSPixelRecHitProducer::produce(edm::Event&, edm::EventSetup const&) (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)
[cut]
==457520==  Address 0x141685370 is 0 bytes after a block of size 1,408 alloc'd
==457520==    at 0x403BEE1: operator new(unsigned long) (in /data/cmssw/el8_amd64_gcc12/external/valgrind/3.24.0-f85a1303334507f502ed8242a93c05bd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==457520==    by 0xC3EA1425: CTPPSPixelRecHitProducer::produce(edm::Event&, edm::EventSetup const&) (in /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_6/lib/el8_amd64_gcc12/pluginRecoPPSLocalPlugins.so)

Dr15Jones avatar Jun 05 '25 13:06 Dr15Jones

So from looking at the code, I THINK the valgrind error comes from https://github.com/cms-sw/cmssw/blob/f79df5782941b39308cf042d779265a765c51b08/DataFormats/Common/interface/DetSetVector.h#L343

since the std::__introsort_loop<....<CTPPSPixelRecHit*...>> seems related to sorting of a std::vector<CTPPSPixelRecHit> which is the .data of a DetSet< CTPPSPixelRecHit>.

I'm wondering if the sorting criteria for a CTPPSPixelRecHit has gone bad , i.e. some value used in the sort is not physical. The sort is

https://github.com/cms-sw/cmssw/blob/f79df5782941b39308cf042d779265a765c51b08/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L70

and if the value for point().mag() where a nan than sorting would go crazy.

Dr15Jones avatar Jun 05 '25 14:06 Dr15Jones

I'm wondering if the sorting criteria for a CTPPSPixelRecHit has gone bad , i.e. some value used in the sort is not physical. The sort is

cmssw/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h

Line 70 in f79df57

inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) { return (a.point().mag() < b.point().mag()); }; and if the value for point().mag() where a nan than sorting would go crazy.

The sort can also go haywire if the comparison operation isn't stable. We've had cases where a complicated calculation in the comparison operator wasn't completely consistent, which caused problems even with a stable sort. In general it's best to pre-compute the values to be sorted on, to avoid any instability and reduce the n*log(n) evaluations to n. Somewhat less ambitious, it might help to use mag2() instead of mag(), avoiding the sqrt().

dan131riley avatar Jun 05 '25 14:06 dan131riley

OK, I can reproduce the crash running only the path HLT_DiPFJetAve180_PPSMatch_Xi0p3_QuadJet_Max2ProtPerRP_v8.

fwyzard avatar Jun 05 '25 15:06 fwyzard

FYI @mmusich https://github.com/cms-sw/cmssw/pull/48235 does not seem to solve the issue :-/

fwyzard avatar Jun 05 '25 16:06 fwyzard

Indeed, adding an assertion fails on the incriminated event:

cmsRun: src/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:72: bool operator<(CTPPSPixelRecHit&, CTPPSPixelRecHit&): Assertion `edm::isFinite(a.point().mag2())' failed.

The weird thing is that this happens only some of the times...

fwyzard avatar Jun 05 '25 16:06 fwyzard

Even if I run only the PPS local reconstruction:

process.path = cms.Path(
    process.hltCTPPSPixelDigis +
    process.hltCTPPSPixelClusters +
    process.hltCTPPSPixelRecHits
)

I get the assertion failure only some of the times.

fwyzard avatar Jun 05 '25 16:06 fwyzard

I wonder if we are getting a memory overwrite based on an uninitialized value.

 unsigned long l = ~0UL;
 assert( std::isnan(*reinterpret_cast<double*>(&l)) );

passes. In fact a lot of weird bit patterns are interpreted as nan.

Dr15Jones avatar Jun 05 '25 17:06 Dr15Jones

In fact a lot of weird bit patterns are interpreted as nan.

A NaN has exponent all 1s and the significand/mantissa non-zero (zero mantissa is infinity), so there are a lot of bit patterns that are interpreted as a NaN. It might be interesting to enable raising floating point exceptions on NaN to see where it comes from (it might also be useless if we generate a lot of them).

dan131riley avatar Jun 05 '25 19:06 dan131riley

So the calculation for the hits location is done here

https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L105-L112

where the used values are calculated here https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L92-L94

and https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L83-L87

with the values being passed into the routine https://github.com/cms-sw/cmssw/blob/31cba16d05c39a0f198f730a6522e1f3d14e7dc6/RecoPPS/Local/src/RPixClusterToHit.cc#L20-L22

one item of note, aCluster is a copy of the CTPPSPixelCluster passed into the routine.

Dr15Jones avatar Jun 05 '25 20:06 Dr15Jones

So assuming it is bad values going into the calculation of the LocalPoint (rather than some how the std::vector holding the hits had a problem) then the only place I could see that could have had a bad floating point value is in the PPSPixelTopology which presumably has had its values pulled during the earlier 77 events. So if it is a floating point value in PPSPixelTopology that went bad than something would have had to step on it (i.e. an out of bound memory write).

Dr15Jones avatar Jun 05 '25 21:06 Dr15Jones

So one test is at line 112 of RecoPPS/Local/src/RPixClusterToHit.cc where LocalPoint lp is made, are the values already bad? If so, which one? Then backtrack to see if the value is bad coming out of the topology call.

Dr15Jones avatar Jun 05 '25 21:06 Dr15Jones

One observation in favour of the hypothesis that the corruption happens earlier than the affected event:

  • if I do skipEvents = 77 and start processing from event 78, I do not seem to get any errors;
  • if I do skipEvents = 76 and start processing from event 77, I often (not always) get an assertion during event 78, when the DetSetVector<CTPPSPixelRecHit> is being put into the event:
    bool operator<(CTPPSPixelRecHit&, CTPPSPixelRecHit&): Assertion `edm::isFinite(a.point().mag2())' failed
    

However if I check and print the mag2() of all the rechits, I don't get any NaNs or seemingly invalid values. In fact, the values are the same when the job crashes and when it doesn't ... 🤷🏻

fwyzard avatar Jun 05 '25 23:06 fwyzard

Weird findings. Did you print mag2() before or after the insertion into the vector?

Dr15Jones avatar Jun 05 '25 23:06 Dr15Jones

Right before.

fwyzard avatar Jun 05 '25 23:06 fwyzard

So one test is at line 112 of RecoPPS/Local/src/RPixClusterToHit.cc where LocalPoint lp is made, are the values already bad? If so, which one? Then backtrack to see if the value is bad coming out of the topology call.

Apparently they aren't. Using this patch:

diff --git a/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h b/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
index cc51dda9816..33270076def 100644
--- a/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
+++ b/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h
@@ -11,6 +11,10 @@
 
 #include "DataFormats/GeometrySurface/interface/LocalError.h"
 #include "DataFormats/GeometryVector/interface/LocalPoint.h"
+#include "FWCore/Utilities/interface/isFinite.h"
+
+#include <iostream>
+#include <cstdlib>  // for std::abort
 
 // Reconstructed hits in CTPPS Pixel detector
 
@@ -67,6 +71,20 @@ private:
   unsigned int clusterSizeCol_;
 };
 
-inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) { return (a.point().mag() < b.point().mag()); };
+inline bool operator<(CTPPSPixelRecHit& a, CTPPSPixelRecHit& b) {
+  //assert(edm::isFinite(a.point().mag2()));
+  //assert(edm::isFinite(b.point().mag2()));
+
+  if (!edm::isFinite(a.point().mag2())) {
+    std::cerr << "Non-finite a.point().mag2(): " << a.point().mag2() << std::endl;
+    std::abort();
+  }
+  if (!edm::isFinite(b.point().mag2())) {
+    std::cerr << "Non-finite b.point().mag2(): " << b.point().mag2() << std::endl;
+    std::abort();
+  }
+
+  return (a.point().mag() < b.point().mag());
+};
 
 #endif
diff --git a/RecoPPS/Local/src/RPixClusterToHit.cc b/RecoPPS/Local/src/RPixClusterToHit.cc
index 4171bbc8d5c..bb3981ba976 100644
--- a/RecoPPS/Local/src/RPixClusterToHit.cc
+++ b/RecoPPS/Local/src/RPixClusterToHit.cc
@@ -1,4 +1,5 @@
 #include "RecoPPS/Local/interface/RPixClusterToHit.h"
+#include "FWCore/Utilities/interface/isFinite.h"
 
 RPixClusterToHit::RPixClusterToHit(edm::ParameterSet const &conf) {
   verbosity_ = conf.getUntrackedParameter<int>("RPixVerbosity");
@@ -109,6 +110,9 @@ void RPixClusterToHit::make_hit(CTPPSPixelCluster aCluster,
   double varianceX = weightedVarianceX * invWeights * invWeights;
   double varianceY = weightedVarianceY * invWeights * invWeights;
 
+  assert(edm::isFinite(avgLocalX));
+  assert(edm::isFinite(avgLocalY));
+
   LocalPoint lp(avgLocalX, avgLocalY, 0);
   LocalError le(varianceX, 0, varianceY);
   if (verbosity_)

it gets aborted at event 78 with:

Non-finite a.point().mag2(): inf


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

mmusich avatar Jun 06 '25 10:06 mmusich

type ctpps

mmusich avatar Jun 06 '25 10:06 mmusich

assign DataFormats/CTPPSReco

mmusich avatar Jun 06 '25 10:06 mmusich