WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Memory increase not applied for resubmitted merge tasks

Open yanfr0818 opened this issue 10 months ago • 6 comments

For failed WFs with 50660-PerformanceKill error, we would want to resubmit the WFs with increased memory. However, when Unified creates the resubmission for ReReco jobs, the memory increase would not be applied for merge tasks, as can be seen in this code: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607.

We don't understand the rationale behind this restriction. If we would like to increase the memory of merge jobs, we should be able to do it.

Example of a failed WF: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC4_r-0-Run2022F_ZeroBias_JMENano12p5_240201_014109_1580 The memory increase is reflected in the JSON config. However, the maxPSS is still set to 2355.2 in the Config tab.

yanfr0818 avatar Mar 28 '24 16:03 yanfr0818

Hi @yanfr0818 , memory increase for merge tasks is not supported in WMCore. Whenever such memory hungry jobs are spotted in production, it usually comes from a configuration and/or CMSSW problem.

I would suggest reporting this to Core Software, as the merge process is supposed to be very lightweight in terms of resources requirements.

amaltaro avatar Mar 28 '24 17:03 amaltaro

Thanks @amaltaro we'll follow up with core but still is there a reason why this memory increase is completely blocked? Ops should be able to increase the memory, if the need be. This might be necessary to finish urgent workflows while we investigate their higher memory usage. We can come up with a PR to lift this restriction if you think this is not a breaking change. @hassan11196 @lucalavezzo FYI

haozturk avatar Apr 03 '24 08:04 haozturk

There are some comments from the software core team, in this issue. In short, the high memory usage is caused by serialization of ParameterSets. While they are trying to find a long-term solution for this, we can resubmit these jobs with a higher memory requirement (say 5GB) for a quick fix.

Does this sound good to you? @amaltaro

yanfr0818 avatar Apr 11 '24 20:04 yanfr0818

Alan, is this matter of removing Merge from this list [1]? We can make a PR for this, but we're not sure how to test it and see whether it'd break something

[1] https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L607

haozturk avatar Apr 12 '24 09:04 haozturk

Hi @amaltaro , we'd like to follow up with this issue. Does the solution that Hasan proposed above work? Can we make a PR for this?

This issue has already been fixed by the software core team and will be propagated with the next release of CMSSW. But we still need to resolve the failed WFs at hand.

yanfr0818 avatar Apr 23 '24 19:04 yanfr0818

Apologies for the belated reply. Last time I looked into this, this process is much more convoluted than what it actually looks.

We would need to have resource requirements for Merge tasks as well and properly map them by their names between workflow assignment and workflow construction. So we would likely have to change the construction of workload objects upstream (ReqMgr2).

Plus, StepChains are even more complicated, given that we have a single resource requirement for all the steps.

My opinion with such development is that we would potentially hurt the system with unnecessary complexity and likely create other bugs.

If CMSSW release is buggy, then we should run the same workflow once a new release is made. Additionally, we could look into allowing CMSSW + ScramArch override in the ACDC creation, if that is really desired.

amaltaro avatar Apr 23 '24 21:04 amaltaro