cmssw [GPU_X] RelVals 29634.40x, 160.03502 failed with cudaErrorLaunchOutOfResources

In CMSSW_15_1_GPU_X_2025-05-29-2300, relvals 29634.40x, 160.03502 failed with cudaErrorLaunchOutOfResources:

----- Begin Fatal Exception 30-May-2025 03:48:48 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 8 event: 701 stream: 0
   [1] Running path 'MC_Ele5_Open_Unseeded'
   [2] Calling method for module SiPixelPhase2DigiToCluster@alpaka/'hltPhase2SiPixelClustersSoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc12/external/alpaka/1.2.0-23a2bf2e896b7aace8e772f289604b47/include/alpaka/mem/buf/uniformCudaHip/Copy.hpp(143) 'TApi::setDevice(m_iDstDevice)' A previous API call (not this one) set the error  : 'cudaErrorLaunchOutOfResources': 'too many resources requested for launch'!
----- End Fatal Exception -------------------------------------------------

Only PR merged since previous GPU_X IB is https://github.com/cms-sw/cmssw/pull/48191.

May 30 '25 09:05 iarspider

assign heterogeneous

May 30 '25 09:05 iarspider

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

May 30 '25 09:05 cmsbuild

cms-bot internal usage

May 30 '25 09:05 cmsbuild

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

May 30 '25 09:05 cmsbuild

Only PR merged since previous GPU_X IB is #48191.

That should be unrelated to any alpaka workflows (those with .4xx).

May 30 '25 09:05 fwyzard

Do you have a link to the logs with the failure ?

May 30 '25 09:05 fwyzard

@fwyzard sure: 160.03502, 29634.402

May 30 '25 09:05 iarspider

Still failing: 160.03502, 29634.402, 29634.403, 29634.404, 29634.406, 29834.403, 29834.404

Jun 17 '25 09:06 iarspider

Looking at the latest IBs, it looks like #47611 fixed this issue as part of the large refactoring of code, as in https://github.com/cms-sw/cmssw/issues/48460#issuecomment-3027898597

Jul 02 '25 13:07 Parsifal-2045

The failures were last seen on CMSSW_15_1_X_2025-06-27-2300 where the jobs were ran on V100 (see e.g. log). All tests since then were ran on H100, so IB tests alone are inconclusive.

Is there a reason to believe https://github.com/cms-sw/cmssw/pull/47611 should have fixed these errors?

Jul 02 '25 15:07 makortel

Is there a reason to believe #47611 should have fixed these errors?

All the tests I've run today were on a GPU development machine at P5 equipped with a single T4. With this setup, all IBs up to and including CMSSW_15_1_X_2025-07-01-1100 crash on any alpaka workflow running on GPU, while merging #47611 on top fixes the issue. Since the compute capabilities of a T4 are quite close to a V100 (7.5 vs 7, respectively), I assume that the refactoring and, in the end, removal of the incriminated kernel should also help in this case. However, I do not have access to any machine equipped with a V100 to test

Jul 02 '25 16:07 Parsifal-2045

Explicit tests on T4 sounds sufficient to me. Thanks!

Jul 02 '25 16:07 makortel

+heterogeneous

Jul 02 '25 16:07 makortel

@cmsbuild, please close

Jul 02 '25 16:07 makortel

This issue is fully signed and ready to be closed.

Jul 02 '25 16:07 cmsbuild