[GPU_X] RelVals 29634.40x, 160.03502 failed with cudaErrorLaunchOutOfResources
In CMSSW_15_1_GPU_X_2025-05-29-2300, relvals 29634.40x, 160.03502 failed with cudaErrorLaunchOutOfResources:
----- Begin Fatal Exception 30-May-2025 03:48:48 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 8 event: 701 stream: 0
[1] Running path 'MC_Ele5_Open_Unseeded'
[2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/alpaka/1.2.0-23a2bf2e896b7aace8e772f289604b47/include/alpaka/mem/buf/uniformCudaHip/Copy.hpp(143) 'TApi::setDevice(m_iDstDevice)' A previous API call (not this one) set the error : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------
Only PR merged since previous GPU_X IB is https://github.com/cms-sw/cmssw/pull/48191.
assign heterogeneous
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @iarspider.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Only PR merged since previous GPU_X IB is #48191.
That should be unrelated to any alpaka workflows (those with .4xx).
Do you have a link to the logs with the failure ?
Looking at the latest IBs, it looks like #47611 fixed this issue as part of the large refactoring of code, as in https://github.com/cms-sw/cmssw/issues/48460#issuecomment-3027898597
The failures were last seen on CMSSW_15_1_X_2025-06-27-2300 where the jobs were ran on V100 (see e.g. log). All tests since then were ran on H100, so IB tests alone are inconclusive.
Is there a reason to believe https://github.com/cms-sw/cmssw/pull/47611 should have fixed these errors?
Is there a reason to believe #47611 should have fixed these errors?
All the tests I've run today were on a GPU development machine at P5 equipped with a single T4. With this setup, all IBs up to and including CMSSW_15_1_X_2025-07-01-1100 crash on any alpaka workflow running on GPU, while merging #47611 on top fixes the issue. Since the compute capabilities of a T4 are quite close to a V100 (7.5 vs 7, respectively), I assume that the refactoring and, in the end, removal of the incriminated kernel should also help in this case. However, I do not have access to any machine equipped with a V100 to test
Explicit tests on T4 sounds sufficient to me. Thanks!
+heterogeneous
@cmsbuild, please close
This issue is fully signed and ready to be closed.