cmssw Hundreds of paused Prompt Reco Tier0 jobs for Fill 10739 (2025, MD)

Dear experts, we have hundreds of paused Tier0 jobs (318 in total) for Fill 10739, scattered across different runs 393512, 393514, 393515, 393516. For all paused jobs I see in the monitoring messages of the kind:

Error in CMSSW step cmsRun1 Number of Cores: 8 Job has exceeded maxPSS: 16000 MB Job has PSS: XXX MB

I suspect the current setup at Tier0 in terms of threads / streams is not compatible in terms of available memory with attempting the reconstruction of the high PU data (up to ~ 135). See also report at this cmsTalk. Few examples of tarballs are available at:

 /eos/user/c/cmst0/public/PausedJobs/Run2025C/highPU_MD

Can please @cms-sw/core-l2 @cms-sw/reconstruction-l2 have a look if these jobs could be salvaged, maybe with a different configuration?

EDIT: Tagging @LinaresToine @jeyserma , so that Tier0 experts can follow.

Jun 24 '25 07:06 mmusich

assign core

Jun 24 '25 07:06 mmusich

assign reconstruction

Jun 24 '25 07:06 mmusich

New categories assigned: core,reconstruction

@Dr15Jones,@jfernan2,@makortel,@mandrenguyen,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Jun 24 '25 07:06 cmsbuild

cms-bot internal usage

Jun 24 '25 07:06 cmsbuild

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Jun 24 '25 07:06 cmsbuild

@srimanob FYI

Jun 24 '25 07:06 mmusich

Thanks @mmusich for the report.

Looking on the Tier-0 configuration, it uses 8 cores with 2 GB memory per code, https://github.com/dmwm/T0/blob/master/etc/ProdOfflineConfiguration.py#L295 so, it does not fit with high PU.

Do we have some log in the past for setting up high PU processing?

Jun 24 '25 08:06 srimanob

Do we have some log in the past for setting up high PU processing?

Just reporting for the record, what was requested to check in the Joint Operations mattermost chat.

For high PU, normally what we do in MC is limit no. of streams, i.e. 8 threads, but 2 streams, to allow reconstruction+DQM happens. I don't see how to set nStreams in T0, do we have that options?

Jun 24 '25 11:06 mmusich

Given the job was running about 2.5GB/concurrent event, I would think trying just 5 or 6 streams might be sufficient to finish the job. That would have the benefit of finishing the job much faster than just using 2 streams.

Jun 24 '25 14:06 Dr15Jones

Testing on file during high PU, with 6 streams, it seems to be OK, reaching 15 GB in total. Note that, the jump in memory shows almost at the end (before 1000th events) then drop. I will need to do more stat. However, if we would like to move on, I propose to go with 5 streams. I should clear out most/all paused jobs, to free the queue.

Note on input file I used: /eos/cms/tier0/store/data/Run2025C/Muon1/RAW/v1/000/393/514/00000/7527edea-6230-4850-8688-bd137c5071db.root , which cover lumi {"393514": [[285, 350]]} It is during high PU of the run.

Jun 25 '25 07:06 srimanob

Update with more event in High PU period,

It seems job may fail in high PU lumisections with streams 6.

In addition, job log reported many "gave up vertex reco for XXX tracks" when no. of tracks is large, i.e. > 1000 tracks. We may need to look at it if we are interested in RECO. Log can be found in /afs/cern.ch/user/s/srimanob/public/ForGitIssue48392/log_S6-v2.log

Jun 25 '25 10:06 srimanob

We may need to look at it if we are interested in RECO.

that's well beyond the scope of this issue.

Jun 25 '25 10:06 mmusich

We may need to look at it if we are interested in RECO.

that's well beyond the scope of this issue.

I agree. I just note on it.

Recent result, we may not survive when we reach high PU lumisection. Reducing streams seems to be the way, but then splitting job may need to update, to make sure job will finish within 48 hour.

Jun 25 '25 18:06 srimanob

FYI: We resumed the 300 paused jobs with 5 streams (keeping 8 threads), and the majority finished successfully. Only 30 jobs (10%) failed again due to similar maxPSS issues. We failed these jobs. The RAW is in any case available for future studies.

Jun 30 '25 14:06 jeyserma

Is there something left to be done within this issue?

Jul 10 '25 19:07 makortel

Is there something left to be done within this issue?

during the Joint Ops meeting of Aug 4th it was asked if anyone checked the reconstruction performance of the jobs that did succeed the reconstruction. I am not sure if anyone (DPGs or POGs) looked at it. Perhaps @cms-sw/reconstruction-l2 have some quick and dirty analysis setup for that. Cc: @sextonkennedy @cms-sw/ppd-l2

Aug 04 '25 11:08 mmusich

+core

Thanks. I take it that the part concerning core is done.

Aug 04 '25 13:08 makortel

+1 Issue seems outdated

Oct 23 '25 17:10 jfernan2

This issue is fully signed and ready to be closed.

Oct 23 '25 17:10 cmsbuild

Issue seems outdated

@jfernan2 this issue is very much still relevant. To my knowledge no attempt to improve the resource usage by reconstruction in high PU (that recurrently happens during MDs) has been done yet.

Oct 23 '25 17:10 mmusich