cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Several paused jobs PromptReco_HIPhysicsRawPrime (JobKilled: WallClockTime) in CMSSW_15_1_0_patch3

Open aneliyaKF opened this issue 1 month ago • 15 comments

Dear everyone,

T0 has a couple of paused jobs, which are killed because they exceeded the wallClockTime limit. As of the time of this post, the paused jobs are in run 399490 (HIPhysicsRawPrime39) and in 399588 HIPhysicsRawPrime48.

The issue started during run 399490 in PromptReco with “PerformanceKill” (ExitCode:50664) that later progressed in “JobKilled” (ExitCode: 71304). Since then, all jobs with similar behaviour have been “JobKilled” (ExitCode: 71304).

For RUN 399490 - T0 have checked the log and also run the job interactively (cms-talk). They have observed that the process gets stuck for a long time at the following step before eventually being killed:

%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  21-Nov-2025 21:13:17 CET Run: 399490 Event: 461872988
curv error not pos-def
[   4.6623e+11  0.00120396    0.327502-5.63433e-06-0.000333395
    0.00120396    -441.504    -212.657  8.9075e-10 5.27075e-08
      0.327502    -212.657    -101.403 2.41768e-07 1.43059e-05
  -5.63433e-06  8.9075e-10 2.41768e-07  1.2469e-09 7.37816e-08
  -0.000333395 5.27075e-08 1.43059e-05 7.37816e-08 4.36581e-06 ]
pos/mom/mf  (0.883477,13.0037,-544.125)   (1.71248e-07,6.38628e-07,-1.88323e-05)   (-0.00251029,-0.036986,2.97803) 
%MSG
%MSG-w TrackNaN:  TrackProducer:detachedTripletStepTracks  21-Nov-2025 21:13:17 CET Run: 399490 Event: 461872988
Track has NaN or the cov is not pos-definite
%MSG
%MSG-e FailedPropagation:  TrackProducer:mergedDuplicateTracks  21-Nov-2025 21:13:24 CET Run: 399490 Event: 462393382
 too many iterations trying to reach plane 
%MSG
%MSG-w PrimaryVertexProducer:   PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:28 CET  Run: 399490 Event: 461950752
More than 50% of candidate vertices lost (1 out of 6)
%MSG
%MSG-w PrimaryVertexProducer:   PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:28 CET  Run: 399490 Event: 461950752
More than 50% of candidate vertices lost (1 out of 6)
%MSG
%MSG-w PrimaryVertexProducer:   PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:52 CET  Run: 399490 Event: 461657642
More than 50% of candidate vertices lost (1 out of 8)
%MSG
%MSG-w PrimaryVertexProducer:   PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:52 CET  Run: 399490 Event: 461657642
More than 50% of candidate vertices lost (1 out of 8)
%MSG

The tarball for For RUN 399490 can be found at:

/eos/user/c/cmst0/public/PausedJobs/HIRun2025A/wallClock/vocms0504.cern.ch-157370-0-log.tar.gz

that leads to an exit code 71304.

For RUN 399588 in PromptReco HIPhysicsRawPrime48 - one of the many similar outputs for this job when run interactively.

%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
curv error not pos-def
[  8.46611e+19 1.01773e-05   -0.672268-2.18081e-06  -0.0173888
   1.01773e-05-2.46123e+08    -38276.2-1.01603e-13-8.05577e-10
     -0.672268    -38276.2     3950.32 6.69938e-09 5.31148e-05
  -2.18081e-06-1.01603e-13 6.69938e-09 9.60888e-14 7.73136e-10
    -0.0173888-8.05577e-10 5.31148e-05 7.73136e-10 6.22094e-06 ]
pos/mom/mf  (-0.387042,2.54153,11.9383)   (0.000395859,0.000114926,-0.000303374)   (-8.74643e-06,5.74339e-05,3.811)
%MSG
%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
curv error not pos-def
[  8.46611e+19 1.01773e-05   -0.672268-2.18081e-06  -0.0173888
   1.01773e-05-2.46123e+08    -38276.2-1.01603e-13-8.05577e-10
     -0.672268    -38276.2     3950.32 6.69938e-09 5.31148e-05
  -2.18081e-06-1.01603e-13 6.69938e-09 9.60888e-14 7.73136e-10
    -0.0173888-8.05577e-10 5.31148e-05 7.73136e-10 6.22094e-06 ]
pos/mom/mf  (-0.387042,2.54153,11.9383)   (0.000395859,0.000114926,-0.000303374)   (-8.74643e-06,5.74339e-05,3.811)
%MSG
%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =          -nan         -nan         -nan
p =          -nan         -nan         -nan
no error defined.

local error valid/values :0
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan ]
%MSG
%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =          -nan         -nan         -nan
p =           nan          nan          nan
no error defined.

local error valid/values :0
[         -nan        -nan        -nan         nan         nan
          -nan        -nan        -nan         nan         nan
          -nan        -nan         nan         nan         nan
           nan         nan         nan         nan         nan
           nan         nan         nan         nan        -nan ]
%MSG

Kind regards, Aneliya (ORM)

aneliyaKF avatar Nov 22 '25 12:11 aneliyaKF

cms-bot internal usage

cmsbuild avatar Nov 22 '25 12:11 cmsbuild

A new Issue was created by @aneliyaKF.

@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Nov 22 '25 12:11 cmsbuild

A similar error was also observed in 1 paused job during run 399655, in PromptReco HIPhysicsRawPrime (51 with "Fatal Exception" (ExitCode: 8021).

%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
local error not pos-def
[      -35.375   -0.990479   -0.132111-0.000319362-0.000348806
     -0.990479  -0.0117493 -0.00463486-2.12342e-06-8.21054e-06
     -0.132111 -0.00463486  0.00757408 -2.6077e-08 4.47035e-06
  -0.000319362-2.12342e-06 -2.6077e-08 3.30137e-06-1.09371e-10
  -0.000348806-8.21054e-06 4.47035e-06-1.09371e-10 3.06887e-06 ]
pos/mom/mf  (0.979133,2.97937,-9.53805)   (0.00632766,0.000876588,-0.000835602)   (-2.46465e-05,-7.49959e-05,3.81095)
%MSG
%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
curv error not pos-def
[      -35.375  0.00594972     0.12615 0.000113238 0.000309221
    0.00594972  0.00105278-0.000151185 2.18029e-08  1.8862e-06
       0.12615-0.000151185-0.000181092-2.03397e-06-1.06495e-06
   0.000113238 2.18029e-08-2.03397e-06  4.2317e-07-1.46018e-07
   0.000309221  1.8862e-06-1.06495e-06-1.46018e-07 3.06504e-06 ]
pos/mom/mf  (0.979133,2.97937,-9.53805)   (0.00632766,0.000876588,-0.000835602)   (-2.46465e-05,-7.49959e-05,3.81095)
%MSG
%MSG-w TrackNaN:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
Track has NaN or the cov is not pos-definite
%MSG
%MSG-w BasicTrajectoryState:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
curv error not pos-def
[      88.3073  0.00707775 9.88759e-06-2.21239e-11-1.77576e-08
    0.00707775-0.000441852 2.82613e-06 3.54845e-15 2.88961e-12
   9.88759e-06 2.82613e-06 5.38871e-05-8.43347e-13-7.24804e-10
  -2.21239e-11 3.54845e-15-8.43347e-13 4.53265e-12 3.70162e-09
  -1.77576e-08 2.88961e-12-7.24804e-10 3.70162e-09 3.02428e-06 ]
pos/mom/mf  (-0.0799883,2.62328,-12.8177)   (10.5548,3.04969,-27.2065)   (2.5853e-06,-8.47869e-05,3.81077)
%MSG
%MSG-w TrackNaN:  TrackProducer:detachedTripletStepTracks  22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
Track has NaN or the cov is not pos-definite

aneliyaKF avatar Nov 24 '25 07:11 aneliyaKF

assign reconstruction

vlimant avatar Nov 24 '25 12:11 vlimant

New categories assigned: reconstruction

@jfernan2,@mandrenguyen,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Nov 24 '25 12:11 cmsbuild

type tracking

vlimant avatar Nov 24 '25 12:11 vlimant

Dear everyone, We have similar errors in run 399710 in PromptReco HIPhysicsRawPrime58, with PerformanceKill , exit code 50664 , and in run 399593 in PromptReco HIPhysicsRawPrime22, with error JobKilled , exit code 71304 . The tarball for job 399710 is in /eos/cms/tier0/store/unmerged/hidata/logs/prod/2025/11/25/PromptReco_Run399710_HIPhysicsRawPrime58/Reco/0000/0/03fa7101-4ac8-4997-9b90-b347ff2e9fb3-77-0-logArchive.tar.gz

Kind regards, Federica (ORM)

friti avatar Nov 25 '25 10:11 friti

Hello @friti @mandrenguyen I have placed all the relevant long job tarballs in

/eos/home-c/cmst0/public/PausedJobs/HIRun2025A/wallClock

LinaresToine avatar Nov 27 '25 14:11 LinaresToine

assign core I am able to reproduce the error when I start processing from the beginning of the file. As soon as I use skipEvents to get anywhere close though, the job seems to complete successfully. The job really seems simply to hang, but I couldn't figure out how to debug further. Core folks, maybe you can recommend some debugging tools to isolate the issue (hence the "assign")? Thanks!

mandrenguyen avatar Dec 01 '25 15:12 mandrenguyen

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Dec 01 '25 15:12 cmsbuild

Adding Tracer service to the job

process.add_(cms.Service("Tracer"))

will show you exactly what module is being run when it gets stuck (or slow).

makortel avatar Dec 01 '25 15:12 makortel

Run it under gdb. When it gets stuck, interrupt with a control-c and do "thread apply all bt".

dan131riley avatar Dec 01 '25 16:12 dan131riley

Adding Tracer service to the job

process.add_(cms.Service("Tracer"))

will show you exactly what module is being run when it gets stuck (or slow).

I added the tracer. The job eventually finished so I was not really able to see where it slowed down (if it did indeed slow down). Previously, I had seen that interactively if I try to skip events to get close the event where the job died in the log, I don't observe any issue. If I start from scratch I do see the job get stuck in the same place as in the provided log. The tracer gives so much screen output that it's a bit tough to know in real time when you're close to that point.

mandrenguyen avatar Dec 06 '25 08:12 mandrenguyen

If the job finished, I'd expect to see an excessive time warning saying that a module took too long (if that's enabled). Also, if you're running these jobs multi-threaded, try seeing if it reproduces running single-threaded. In a multi-threaded job you really have no idea what event it gets stuck on, as the other streams will continue processing until they wait for completion at the end of the job/run--so it could be the first event in the file that has a problem. If it replicates single-threaded, that will pin down which event and module is getting hung up.

dan131riley avatar Dec 06 '25 13:12 dan131riley

There is another option, you can tell the Tracer service to write a file rather than output to the log. Then you can use the tracer viewer https://github.com/cms-sw/cmssw/tree/CMSSW_15_1_X/FWCore/Services/web

to visualize the various timings happening in the job.

Dr15Jones avatar Dec 06 '25 14:12 Dr15Jones