[AARCH64] RelVal 30834.0: Assertion 'iadd == 1' failed
In CMSSW_16_0_X_2025-12-02-1100, RelVal 30834.0 failed with assertion failure:
cmsRun: src/L1Trigger/TrackFindingTracklet/src/InputRouter.cc:78: void trklet::InputRouter::execute(): Assertion `iadd == 1' failed.
<...>
Thread 6 (Thread 0x400089bed330 (LWP 2488290) "cmsRun"):
#7 0x000040003b08dc18 in __assert_fail () from /lib64/libc.so.6
#8 0x000040008656c680 in trklet::InputRouter::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02918/el8_aarch64_gcc14/cms/cmssw/CMSSW_16_0_X_2025-12-02-1100/lib/el8_aarch64_gcc14/libL1TriggerTrackFindingTracklet.so
#9 0x000040008659ee70 in trklet::Sector::executeIR() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02918/el8_aarch64_gcc14/cms/cmssw/CMSSW_16_0_X_2025-12-02-1100/lib/el8_aarch64_gcc14/libL1TriggerTrackFindingTracklet.so
#10 0x000040008660e678 in trklet::TrackletEventProcessor::event(trklet::SLHCEvent&, std::vector<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::vector<std::vector<trklet::StubStreamData, std::allocator<trklet::StubStreamData> >, std::allocator<std::vector<trklet::StubStreamData, std::allocator<trklet::StubStreamData> > > >&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02918/el8_aarch64_gcc14/cms/cmssw/CMSSW_16_0_X_2025-12-02-1100/lib/el8_aarch64_gcc14/libL1TriggerTrackFindingTracklet.so
#11 0x000040008648f514 in L1FPGATrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02918/el8_aarch64_gcc14/cms/cmssw/CMSSW_16_0_X_2025-12-02-1100/lib/el8_aarch64_gcc14/pluginTrackFindingTrackletPlugins.so
Full log: link
cms-bot internal usage
A new Issue was created by @iarspider.
@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign from L1Trigger/TrackFindingTracklet
New categories assigned: l1
@BenjaminRS,@quinnanm you have been requested to review this Pull request/Issue and eventually sign? Thanks
FAO: @tomalin @zdemirag
Very strange. The assertion error at https://github.com/cms-sw/cmssw/blob/CMSSW_16_0_X_2025-12-02-1100/L1Trigger/TrackFindingTracklet/src/InputRouter.cc#L77 occurs in the prompt HYBRID tracking at event 10. If the bug is this frequent, it's surprising its not been seen before. The assertion implies a bug in the tracklet wiring map, with multiple InputLinkMemories incorrectly assigned to a single (phi region, tracker layer).
What are the commands needed to reproduce this error?
In an attempt to reproduce the crash, I executed the displaced L1 tracking with these linux commands:
- cmsrel CMSSW_16_0_X_2025-12-02-1100
- cd CMSSW_16_0_X_2025-12-02-1100/src/
- cmsenv; scram b -j; cd L1Trigger/TrackFindingTracklet/test
- (Edit L1TrackNtupleMaker_cfg.py to specify HYBRID_DISPLACED L1 track algorithm, MC dataset = /RelValTTbar_14TeV/CMSSW_16_0_0_pre2-PU_150X_mcRun4_realistic_v1_STD_Run4D110_PU-v1/GEN-SIM-DIGI-RAW)
- cmsRun L1TrackNtupleMaker_cfg.py
I see no crash after 100 events. I also ran with a different geometry sample, /RelValTTbar_14TeV/CMSSW_16_0_0_pre2-PU_150X_mcRun4_realistic_v1_STD_Run4D121_PU-v1/GEN-SIM-DIGI-RAW , with the same result.
What are the commands needed to reproduce this error?
Theoretically runTheMatrix.py -l 30834.0 on ARM (step2 of the workflow). From a quick look of IBs on ARM this assertion failure seems to be infrequent as it does not occur every time.
Maybe worth of running through valgrind?
There are some UBSAN hits in that neighborhood, of the form
src/L1Trigger/TrackFindingTracklet/interface/TripletEngineUnit.h:27:9: runtime error: load of value 224, which is not a valid value for type 'bool'
#0 0x14c29798d48e in trklet::TripletEngineUnit::TripletEngineUnit(trklet::TripletEngineUnit const&) src/L1Trigger/TrackFindingTracklet/interface/TripletEngineUnit.h:27
#1 0x14c29798dd85 in void std::_Construct<trklet::TripletEngineUnit, trklet::TripletEngineUnit const&>(trklet::TripletEngineUnit*, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_construct.h:119
#2 0x14c29798dd85 in trklet::TripletEngineUnit* std::__do_uninit_fill_n<trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit>(trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_uninitialized.h:267
#3 0x14c29798dd85 in trklet::TripletEngineUnit* std::__uninitialized_fill_n<false>::__uninit_fill_n<trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit>(trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_uninitialized.h:284
#4 0x14c29798dd85 in trklet::TripletEngineUnit* std::uninitialized_fill_n<trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit>(trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_uninitialized.h:327
#5 0x14c29798dd85 in trklet::TripletEngineUnit* std::__uninitialized_fill_n_a<trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit, trklet::TripletEngineUnit>(trklet::TripletEngineUnit*, unsigned long, trklet::TripletEngineUnit const&, std::allocator<trklet::TripletEngineUnit>&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_uninitialized.h:472
#6 0x14c29798dd85 in std::vector<trklet::TripletEngineUnit, std::allocator<trklet::TripletEngineUnit> >::_M_fill_insert(__gnu_cxx::__normal_iterator<trklet::TripletEngineUnit*, std::vector<trklet::TripletEngineUnit, std::allocator<trklet::TripletEngineUnit> > >, unsigned long, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/vector.tcc:592
#7 0x14c29796cbb0 in std::vector<trklet::TripletEngineUnit, std::allocator<trklet::TripletEngineUnit> >::resize(unsigned long, trklet::TripletEngineUnit const&) /data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/gcc/13.4.0-6908cfdf803923e783448096ca4f0923/include/c++/13.4.0/bits/stl_vector.h:1037
#8 0x14c29796cbb0 in trklet::TrackletProcessorDisplaced::execute(unsigned int, double, double) src/L1Trigger/TrackFindingTracklet/src/TrackletProcessorDisplaced.cc:171
#9 0x14c2973ee84a in trklet::Sector::executeTPD() src/L1Trigger/TrackFindingTracklet/src/Sector.cc:379
#10 0x14c297894e21 in trklet::TrackletEventProcessor::event(trklet::SLHCEvent&, std::vector<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::vector<std::vector<trklet::StubStreamData, std::allocator<trklet::StubStreamData> >, std::allocator<std::vector<trklet::StubStreamData, std::allocator<trklet::StubStreamData> > > >&) src/L1Trigger/TrackFindingTracklet/src/TrackletEventProcessor.cc:335
#11 0x14c298a56070 in L1FPGATrackProducer::produce(edm::Event&, edm::EventSetup const&) src/L1Trigger/TrackFindingTracklet/plugins/L1FPGATrackProducer.cc:678
I think these are due to uninitialized elements of the TripletEngineUnit class. Valgrind memcheck would certainly confirm this. After confirming with valgrind, you might consider preemptively initializing all the fundamental type class member elements, something like
const Settings* settings_{nullptr};
unsigned int layerdisk1_{0U};
unsigned int layerdisk2_{0U};
unsigned int layerdisk3_{0U};
unsigned int iSeed_{0U};
bool nearfull_{false}; //initialized at start of each processing step
[...]
Uninitialized values are one of the chief causes of undefined behavior bugs.
"runTheMatrix.py -l 30834.0" runs without errors on my EL9 X86 computer. I logged into lxplus-arm to try it there, but unfortunately, this is ARM EL9 (=el9_6.aarch64), for which neither the daily releases /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/ nor the normal prereleases have pre-compiled CMSSW code.
We have EL8 containers also for ARM that you can start with cmssw-el8, and then run the subsequent commands in that subshell.
I've created a draft PR https://github.com/cms-sw/cmssw/pull/49550 , where I'm adding fixes for memory leaks and undefined variables in the L1 tracking code .
So far, I've addressed all bugs found by using VALGRIND with the prompt L1 tracking (algorithm "HYBRID") and displaced tracking (algorithm "HYBRID_DISPLACED"). I compiled the L1 track code using "-g -O0", and then executed:
valgrind --track-origins=yes --leak-check=full --show-leak-kinds=definite --suppressions=${ROOTSYS}/etc/valgrind-root.supp --suppressions=$CMSSW_RELEASE_BASE/src/Utilities/ReleaseScripts/data/cms-valgrind-memcheck.supp cmsRun L1TrackNtupleMaker_cfg.py >&! valgrind.out
Despite running on only 2 events of particle gun muons, this takes almost 2 hours to execute! It produces 10k lines of output reporting all the errors it found, only 5% of which are in the L1 tracking. Does anyone know how to speed valgrind up, or suppress its uninteresting output?
I wish to repeat this with the -fsanitize compiler option. Does anyone know how to compile CMSSW code with this? I tried unsuccessfully (tcsh): setenv USER_CXXFLAGS "-g -O0 -fsanitize=address,undefined" setenv USER_LDFLAGS "-fsanitize=address,undefined" but scram complains that "ASan runtime does not come first in initial library list."
I wish to repeat this with the -fsanitize compiler option. Does anyone know how to compile CMSSW code with this?
It is easiest to use the ASAN and UBSAN IBs for the address and undefined sanitizers. In developer areas of those the sanitizers get compiled in automatically. You can find the ASAN/UBSAN IBs either via the IB dashboard https://cmssdt.cern.ch/SDT/html/cmssdt-ib or scram -a el8_amd64_gcc13 list CMSSW | grep SAN_X.