HLT crashes in Run 382461
Crashes observed in collisions Run 382461. Error message:
----- Begin Fatal Exception 26-Jun-2024 14:33:41 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 382461 lumi: 2 event: 4698821 stream: 0
[1] Running path 'DQM_EcalReconstruction_v10'
[2] Calling method for module EcalUncalibRecHitProducerPortable@alpaka/'hltEcalUncalibRecHitSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-b\
uild/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUni\
formCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not th\
is one) set the error : 'cudaErrorInvalidConfiguration': 'invalid configuration argument'!
----- End Fatal Exception -------------------------------------------------
Reproducer:
#!/bin/bash -ex
# CMSSW_14_0_9_patch1_MULTIARCHS
hltGetConfiguration run:382461 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000928.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000929.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000930.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000931.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000932.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000933.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000934.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000935.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000936.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000937.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000938.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000939.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000940.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000941.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000942.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000943.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000944.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000945.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000946.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000947.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000948.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000949.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000950.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000951.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000952.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000953.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000954.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000955.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000956.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000957.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000958.root,\
/store/group/tsg/FOG/error_stream_root/run382461/run382461_ls0002_index000959.root > hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
cmsRun hlt.py &> hlt.log
Notice that this run has no ECAL barrel, but part of the endcap. @fwyzard has noticed that this is probably related: the protection we implemented for empty ECAL events was on the total size, but there is one kernel that is barrel-only.
Best regards, Thiago (for FOG)
cms-bot internal usage
A new Issue was created by @trtomei.
@Dr15Jones, @antoniovilela, @makortel, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, reconstruction, heterogeneous
type ecal
@cms-sw/ecal-dpg-l2 FYI
New categories assigned: hlt,reconstruction,heterogeneous
@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
Should be fixed by #45311 (14.1.x) / #45313 (14.0.x) / #45314 (14.0.9-patchX).
FWIW I confirm that:
cmsrel CMSSW_14_0_9_patch1_MULTIARCHS
cd CMSSW_14_0_9_patch1_MULTIARCHS/src/
git cms-init
cmsenv
git cms-addpkg RecoLocalCalo/EcalRecProducers
git remote add fwyzard [email protected]:fwyzard/cmssw.git; git fetch fwyzard
git cherry-pick d0f844fb548ac5bd7f8ee6b5daa6476809cb4033
scram b -j 20
tested with the reproducer at https://github.com/cms-sw/cmssw/issues/45312#issue-2375234733 leads to no crashes.
Solutions proposed all merged:
- https://github.com/cms-sw/cmssw/pull/45311 (master branch)
- https://github.com/cms-sw/cmssw/pull/45313 (14.0.X branch)
- https://github.com/cms-sw/cmssw/pull/45314 (14.0.9_patchX branch)
+hlt
- confirmed the fix is effective testing on affected data, https://github.com/cms-sw/cmssw/issues/45312#issuecomment-2191661140
- all solutions PRs are merged, fix will be available online as of
CMSSW_14_0_9_patch2
+1
+heterogeneous
@cmsbuild, please close
This issue is fully signed and ready to be closed.