Several paused jobs PromptReco_HIPhysicsRawPrime (JobKilled: WallClockTime) in CMSSW_15_1_0_patch3
Dear everyone,
T0 has a couple of paused jobs, which are killed because they exceeded the wallClockTime limit. As of the time of this post, the paused jobs are in run 399490 (HIPhysicsRawPrime39) and in 399588 HIPhysicsRawPrime48.
The issue started during run 399490 in PromptReco with “PerformanceKill” (ExitCode:50664) that later progressed in “JobKilled” (ExitCode: 71304). Since then, all jobs with similar behaviour have been “JobKilled” (ExitCode: 71304).
For RUN 399490 - T0 have checked the log and also run the job interactively (cms-talk). They have observed that the process gets stuck for a long time at the following step before eventually being killed:
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 21-Nov-2025 21:13:17 CET Run: 399490 Event: 461872988
curv error not pos-def
[ 4.6623e+11 0.00120396 0.327502-5.63433e-06-0.000333395
0.00120396 -441.504 -212.657 8.9075e-10 5.27075e-08
0.327502 -212.657 -101.403 2.41768e-07 1.43059e-05
-5.63433e-06 8.9075e-10 2.41768e-07 1.2469e-09 7.37816e-08
-0.000333395 5.27075e-08 1.43059e-05 7.37816e-08 4.36581e-06 ]
pos/mom/mf (0.883477,13.0037,-544.125) (1.71248e-07,6.38628e-07,-1.88323e-05) (-0.00251029,-0.036986,2.97803)
%MSG
%MSG-w TrackNaN: TrackProducer:detachedTripletStepTracks 21-Nov-2025 21:13:17 CET Run: 399490 Event: 461872988
Track has NaN or the cov is not pos-definite
%MSG
%MSG-e FailedPropagation: TrackProducer:mergedDuplicateTracks 21-Nov-2025 21:13:24 CET Run: 399490 Event: 462393382
too many iterations trying to reach plane
%MSG
%MSG-w PrimaryVertexProducer: PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:28 CET Run: 399490 Event: 461950752
More than 50% of candidate vertices lost (1 out of 6)
%MSG
%MSG-w PrimaryVertexProducer: PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:28 CET Run: 399490 Event: 461950752
More than 50% of candidate vertices lost (1 out of 6)
%MSG
%MSG-w PrimaryVertexProducer: PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:52 CET Run: 399490 Event: 461657642
More than 50% of candidate vertices lost (1 out of 8)
%MSG
%MSG-w PrimaryVertexProducer: PrimaryVertexProducer:unsortedOfflinePrimaryVertices 21-Nov-2025 21:13:52 CET Run: 399490 Event: 461657642
More than 50% of candidate vertices lost (1 out of 8)
%MSG
The tarball for For RUN 399490 can be found at:
/eos/user/c/cmst0/public/PausedJobs/HIRun2025A/wallClock/vocms0504.cern.ch-157370-0-log.tar.gz
that leads to an exit code 71304.
For RUN 399588 in PromptReco HIPhysicsRawPrime48 - one of the many similar outputs for this job when run interactively.
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
curv error not pos-def
[ 8.46611e+19 1.01773e-05 -0.672268-2.18081e-06 -0.0173888
1.01773e-05-2.46123e+08 -38276.2-1.01603e-13-8.05577e-10
-0.672268 -38276.2 3950.32 6.69938e-09 5.31148e-05
-2.18081e-06-1.01603e-13 6.69938e-09 9.60888e-14 7.73136e-10
-0.0173888-8.05577e-10 5.31148e-05 7.73136e-10 6.22094e-06 ]
pos/mom/mf (-0.387042,2.54153,11.9383) (0.000395859,0.000114926,-0.000303374) (-8.74643e-06,5.74339e-05,3.811)
%MSG
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
curv error not pos-def
[ 8.46611e+19 1.01773e-05 -0.672268-2.18081e-06 -0.0173888
1.01773e-05-2.46123e+08 -38276.2-1.01603e-13-8.05577e-10
-0.672268 -38276.2 3950.32 6.69938e-09 5.31148e-05
-2.18081e-06-1.01603e-13 6.69938e-09 9.60888e-14 7.73136e-10
-0.0173888-8.05577e-10 5.31148e-05 7.73136e-10 6.22094e-06 ]
pos/mom/mf (-0.387042,2.54153,11.9383) (0.000395859,0.000114926,-0.000303374) (-8.74643e-06,5.74339e-05,3.811)
%MSG
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = -nan -nan -nan
p = -nan -nan -nan
no error defined.
local error valid/values :0
[ -nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan ]
%MSG
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 18:11:52 CET Run: 399588 Event: 431413561
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = -nan -nan -nan
p = nan nan nan
no error defined.
local error valid/values :0
[ -nan -nan -nan nan nan
-nan -nan -nan nan nan
-nan -nan nan nan nan
nan nan nan nan nan
nan nan nan nan -nan ]
%MSG
Kind regards, Aneliya (ORM)
cms-bot internal usage
A new Issue was created by @aneliyaKF.
@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
A similar error was also observed in 1 paused job during run 399655, in PromptReco HIPhysicsRawPrime (51 with "Fatal Exception" (ExitCode: 8021).
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
local error not pos-def
[ -35.375 -0.990479 -0.132111-0.000319362-0.000348806
-0.990479 -0.0117493 -0.00463486-2.12342e-06-8.21054e-06
-0.132111 -0.00463486 0.00757408 -2.6077e-08 4.47035e-06
-0.000319362-2.12342e-06 -2.6077e-08 3.30137e-06-1.09371e-10
-0.000348806-8.21054e-06 4.47035e-06-1.09371e-10 3.06887e-06 ]
pos/mom/mf (0.979133,2.97937,-9.53805) (0.00632766,0.000876588,-0.000835602) (-2.46465e-05,-7.49959e-05,3.81095)
%MSG
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
curv error not pos-def
[ -35.375 0.00594972 0.12615 0.000113238 0.000309221
0.00594972 0.00105278-0.000151185 2.18029e-08 1.8862e-06
0.12615-0.000151185-0.000181092-2.03397e-06-1.06495e-06
0.000113238 2.18029e-08-2.03397e-06 4.2317e-07-1.46018e-07
0.000309221 1.8862e-06-1.06495e-06-1.46018e-07 3.06504e-06 ]
pos/mom/mf (0.979133,2.97937,-9.53805) (0.00632766,0.000876588,-0.000835602) (-2.46465e-05,-7.49959e-05,3.81095)
%MSG
%MSG-w TrackNaN: TrackProducer:detachedTripletStepTracks 22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
Track has NaN or the cov is not pos-definite
%MSG
%MSG-w BasicTrajectoryState: TrackProducer:detachedTripletStepTracks 22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
curv error not pos-def
[ 88.3073 0.00707775 9.88759e-06-2.21239e-11-1.77576e-08
0.00707775-0.000441852 2.82613e-06 3.54845e-15 2.88961e-12
9.88759e-06 2.82613e-06 5.38871e-05-8.43347e-13-7.24804e-10
-2.21239e-11 3.54845e-15-8.43347e-13 4.53265e-12 3.70162e-09
-1.77576e-08 2.88961e-12-7.24804e-10 3.70162e-09 3.02428e-06 ]
pos/mom/mf (-0.0799883,2.62328,-12.8177) (10.5548,3.04969,-27.2065) (2.5853e-06,-8.47869e-05,3.81077)
%MSG
%MSG-w TrackNaN: TrackProducer:detachedTripletStepTracks 22-Nov-2025 19:13:58 CET Run: 399655 Event: 252587784
Track has NaN or the cov is not pos-definite
assign reconstruction
New categories assigned: reconstruction
@jfernan2,@mandrenguyen,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks
type tracking
Dear everyone,
We have similar errors in run 399710 in PromptReco HIPhysicsRawPrime58, with PerformanceKill , exit code 50664 , and in run 399593 in PromptReco HIPhysicsRawPrime22, with error JobKilled , exit code 71304 .
The tarball for job 399710 is in
/eos/cms/tier0/store/unmerged/hidata/logs/prod/2025/11/25/PromptReco_Run399710_HIPhysicsRawPrime58/Reco/0000/0/03fa7101-4ac8-4997-9b90-b347ff2e9fb3-77-0-logArchive.tar.gz
Kind regards, Federica (ORM)
Hello @friti @mandrenguyen I have placed all the relevant long job tarballs in
/eos/home-c/cmst0/public/PausedJobs/HIRun2025A/wallClock
assign core
I am able to reproduce the error when I start processing from the beginning of the file. As soon as I use skipEvents to get anywhere close though, the job seems to complete successfully.
The job really seems simply to hang, but I couldn't figure out how to debug further. Core folks, maybe you can recommend some debugging tools to isolate the issue (hence the "assign")? Thanks!
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
Adding Tracer service to the job
process.add_(cms.Service("Tracer"))
will show you exactly what module is being run when it gets stuck (or slow).
Run it under gdb. When it gets stuck, interrupt with a control-c and do "thread apply all bt".
Adding Tracer service to the job
process.add_(cms.Service("Tracer"))
will show you exactly what module is being run when it gets stuck (or slow).
I added the tracer. The job eventually finished so I was not really able to see where it slowed down (if it did indeed slow down). Previously, I had seen that interactively if I try to skip events to get close the event where the job died in the log, I don't observe any issue. If I start from scratch I do see the job get stuck in the same place as in the provided log. The tracer gives so much screen output that it's a bit tough to know in real time when you're close to that point.
If the job finished, I'd expect to see an excessive time warning saying that a module took too long (if that's enabled). Also, if you're running these jobs multi-threaded, try seeing if it reproduces running single-threaded. In a multi-threaded job you really have no idea what event it gets stuck on, as the other streams will continue processing until they wait for completion at the end of the job/run--so it could be the first event in the file that has a problem. If it replicates single-threaded, that will pin down which event and module is getting hung up.
There is another option, you can tell the Tracer service to write a file rather than output to the log. Then you can use the tracer viewer https://github.com/cms-sw/cmssw/tree/CMSSW_15_1_X/FWCore/Services/web
to visualize the various timings happening in the job.