Hundreds of paused Prompt Reco Tier0 jobs for Fill 10739 (2025, MD)
Dear experts, we have hundreds of paused Tier0 jobs (318 in total) for Fill 10739, scattered across different runs 393512, 393514, 393515, 393516. For all paused jobs I see in the monitoring messages of the kind:
Error in CMSSW step cmsRun1 Number of Cores: 8 Job has exceeded maxPSS: 16000 MB Job has PSS: XXX MB
I suspect the current setup at Tier0 in terms of threads / streams is not compatible in terms of available memory with attempting the reconstruction of the high PU data (up to ~ 135). See also report at this cmsTalk. Few examples of tarballs are available at:
/eos/user/c/cmst0/public/PausedJobs/Run2025C/highPU_MD
Can please @cms-sw/core-l2 @cms-sw/reconstruction-l2 have a look if these jobs could be salvaged, maybe with a different configuration?
EDIT: Tagging @LinaresToine @jeyserma , so that Tier0 experts can follow.
assign core
assign reconstruction
New categories assigned: core,reconstruction
@Dr15Jones,@jfernan2,@makortel,@mandrenguyen,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @mmusich.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@srimanob FYI
Thanks @mmusich for the report.
Looking on the Tier-0 configuration, it uses 8 cores with 2 GB memory per code, https://github.com/dmwm/T0/blob/master/etc/ProdOfflineConfiguration.py#L295 so, it does not fit with high PU.
Do we have some log in the past for setting up high PU processing?
Do we have some log in the past for setting up high PU processing?
Just reporting for the record, what was requested to check in the Joint Operations mattermost chat.
For high PU, normally what we do in MC is limit no. of streams, i.e. 8 threads, but 2 streams, to allow reconstruction+DQM happens. I don't see how to set nStreams in T0, do we have that options?
Given the job was running about 2.5GB/concurrent event, I would think trying just 5 or 6 streams might be sufficient to finish the job. That would have the benefit of finishing the job much faster than just using 2 streams.
Testing on file during high PU, with 6 streams, it seems to be OK, reaching 15 GB in total. Note that, the jump in memory shows almost at the end (before 1000th events) then drop. I will need to do more stat. However, if we would like to move on, I propose to go with 5 streams. I should clear out most/all paused jobs, to free the queue.
Note on input file I used: /eos/cms/tier0/store/data/Run2025C/Muon1/RAW/v1/000/393/514/00000/7527edea-6230-4850-8688-bd137c5071db.root , which cover lumi {"393514": [[285, 350]]} It is during high PU of the run.
Update with more event in High PU period,
It seems job may fail in high PU lumisections with streams 6.
In addition, job log reported many
"gave up vertex reco for XXX tracks" when no. of tracks is large, i.e. > 1000 tracks.
We may need to look at it if we are interested in RECO.
Log can be found in /afs/cern.ch/user/s/srimanob/public/ForGitIssue48392/log_S6-v2.log
We may need to look at it if we are interested in RECO.
that's well beyond the scope of this issue.
We may need to look at it if we are interested in RECO.
that's well beyond the scope of this issue.
I agree. I just note on it.
Recent result,
we may not survive when we reach high PU lumisection. Reducing streams seems to be the way, but then splitting job may need to update, to make sure job will finish within 48 hour.
FYI: We resumed the 300 paused jobs with 5 streams (keeping 8 threads), and the majority finished successfully. Only 30 jobs (10%) failed again due to similar maxPSS issues. We failed these jobs. The RAW is in any case available for future studies.
Is there something left to be done within this issue?
Is there something left to be done within this issue?
during the Joint Ops meeting of Aug 4th it was asked if anyone checked the reconstruction performance of the jobs that did succeed the reconstruction. I am not sure if anyone (DPGs or POGs) looked at it. Perhaps @cms-sw/reconstruction-l2 have some quick and dirty analysis setup for that. Cc: @sextonkennedy @cms-sw/ppd-l2
+core
Thanks. I take it that the part concerning core is done.
+1 Issue seems outdated
This issue is fully signed and ready to be closed.
Issue seems outdated
@jfernan2 this issue is very much still relevant. To my knowledge no attempt to improve the resource usage by reconstruction in high PU (that recurrently happens during MDs) has been done yet.