cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

[PY312/nVidia T4] Multiple RelVals failed with cudaErrorLaunchOutOfResources

Open iarspider opened this issue 1 month ago • 7 comments

The following RelVals failed in CMSSW_16_0_PY312_X_2025-12-08-2300 with nVidia T4 : 160.03502, 17034.402, 17034.403, 17034.406, 17034.412, 17034.422, 17034.423, 18434.402, 18434.403, 18434.404, 18434.406, 18434.407, 18434.408, 18434.412, 18434.413, 18434.422, 18434.423, 18434.424, 18450.402, 18450.403, 18450.404, 18450.406, 18450.407, 18450.408, 18461.402, 18634.402, 18634.403, 18634.404, 18634.406, 18634.407, 18634.408, 18634.412, 18634.413, 18634.422, 18634.423, 18634.424, 18650.402, 18650.403, 18650.404, 18650.406, 18650.407, 18650.408, 18661.402

Example stack traces:

  • RelVal 160.03502
----- Begin Fatal Exception 09-Dec-2025 07:56:51 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 3
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module MuonProducer/'muons'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [11] Prefetching for module CkfTrackCandidateMaker/'electronCkfTrackCandidates'
   [12] Prefetching for module ElectronSeedMerger/'electronMergedSeeds'
   [13] Prefetching for module GoodSeedProducer/'trackerDrivenElectronSeeds'
   [14] Prefetching for module PFMultiDepthClusterProducer/'particleFlowClusterHCAL'
   [15] Prefetching for module LegacyPFClusterProducer/'legacyPFClusterProducer'
   [16] Prefetching for module PFClusterSoAProducer@alpaka/'pfClusterSoAProducer'
   [17] Calling method for module PFClusterSoAProducer@alpaka/'pfClusterSoAProducer'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/alpaka/2.0.0-8493f1d11d0378dc14d6ea6ecfc69ac5/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(275) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------
  • Other RelVals:
----- Begin Fatal Exception 09-Dec-2025 07:51:55 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 4 event: 304 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v11'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc13/external/alpaka/2.0.0-8493f1d11d0378dc14d6ea6ecfc69ac5/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(275) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------

iarspider avatar Dec 09 '25 09:12 iarspider

cms-bot internal usage

cmsbuild avatar Dec 09 '25 09:12 cmsbuild

A new Issue was created by @iarspider.

@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Dec 09 '25 09:12 cmsbuild

assign heterogeneous

makortel avatar Dec 09 '25 15:12 makortel

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Dec 09 '25 15:12 cmsbuild

What is different in this build with respect to the regular ones ?

fwyzard avatar Dec 09 '25 15:12 fwyzard

Python version (3.12)

iarspider avatar Dec 09 '25 15:12 iarspider

How does that affect C++ and CUDA code ?

fwyzard avatar Dec 09 '25 16:12 fwyzard