Memory Jump from `14_1_0_pre5` for Phase2 Workflows
In one of the latest round of RelVal productions we have observed an increased usage of memory for the RECO step for Phase2 (D110) PU workflows. We spotted this since many jobs started to fail due to an excessive memory usage. This happened because we usually set the memory limit per job to 16GB (for 8 threads 2 streams jobs) and we went quite frequently above that threshold. See e.g. the error report for a TTbar workflow showing the error report (find below the memory reported by condor to have exceeded the maxPSS). For the equivalent in pre4 we just had a few failures.
See also the PeakValueRss reported in the FrameworkJobReport.xml output for each job. I'm not sure how to interpret the low value tail for pre5.
Bottom line: something happened between pre4 and pre5 that caused the memory usage at RECO step to jump quite a bit.
Reports
I've copied all the job reports and configs for TTbar PU=200 here or you can check in
/eos/cms/store/logs/prod/2024/06/WMAgent/pdmvserv_RVCMSSW_14_1_0_pre4TTbar_14TeV__STD_2026D110_PU_240604_235438_8449//eos/cms/store/logs/prod/2024/08/WMAgent/pdmvserv_RVCMSSW_14_1_0_pre5TTbar_14TeV__STD_2026D110_PU_240801_183220_4814/
We have "solved" this on our side by rising the memory used to 20GB. And the situation is unchanged in 14_1_0pre6 (and 14_1_0_pre7) since we are still seing failures if the memory max is set to 16GB.
assign reconstruction
assign upgrade
New categories assigned: reconstruction,upgrade
@jfernan2,@mandrenguyen,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @AdrianoDee.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@AdrianoDee What would be the way to reproduce these jobs in pre4 and pre5 (onwards)?
@makortel if you check this (or under /eos/user/a/adiflori/www/phase2_pre5_memory_issue/)
I've copied all the job reports for TTbar PU=200 here
you will find all the config used by each job (so with the relative lumisections) as a PSet.py (loading a pickle file). E.g. in /eos/user/a/adiflori/www/phase2_pre5_memory_issue/14_1_0_pre5/0/job/WMTaskSpace/cmsRun1 there's everything needed to run it
[adiflori@lxplus957 cmsRun1]$ ls -tlrh
total 67M
-rw-r--r--. 1 adiflori zh 419K Sep 2 01:15 FrameworkJobReport.xml
-rw-r--r--. 1 adiflori zh 128 Sep 2 01:15 PSet.py
-rw-r--r--. 1 adiflori zh 38M Sep 2 01:15 PSet.pkl
-rw-r--r--. 1 adiflori zh 70K Sep 2 01:15 cmsRun1-stdout.log
-rw-r--r--. 1 adiflori zh 54K Sep 2 01:15 Report.pkl
-rw-r--r--. 1 adiflori zh 227 Sep 2 01:15 cmsRun1-stderr.log
drwxr-xr-x. 2 adiflori zh 4.0K Sep 3 16:30 __pycache__
-rw-r--r--. 1 adiflori zh 29M Sep 3 16:30 dump.py
where (dump.py) was just me checking the edmConfigDump output made sense (and it does).
(Made clearer in the description the link includes the job configs too)
From RECO side we have been looking at this increase in RSS memory profile since the beginning[1] (indeed there is a drop of RSS to normal levels at the end of the tasks) though igprof profile does not see any difference in mem live consumption[2][3]
We have looked at the allocated memory from individual Reco Modules in pre4[4] and pre5[5] and just a general light increase in across all modules can be seen, but not a single real culprit.
Our suspect from the beginning was the change to AL9 of the test machines, since pre4>pre5 was at almost the same time as the transition of the operative system, but we did not know if this theory made sense since we ignored how RSS mem is managed. I have discarded this hypotesis now by running pre4 again in AL9.
[1] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_12634.21.html [2] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/12634.21/step3/mem_live.1.html [3] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/12634.21/step3/mem_live.399.html
[4] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/circles/piechart.php?local=false&dataset=CMSSW_14_1_0_pre4%2Fel8_amd64_gcc12%2F12634.21%2Fstep3_circles&resource=mem_alloc&colours=default&groups=reco_PhaseII_private&threshold=0
[5] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/circles/piechart.php?local=false&dataset=CMSSW_14_1_0_pre5%2Fel8_amd64_gcc12%2F12634.21%2Fstep3_circles&resource=mem_alloc&colours=default&groups=reco_PhaseII_private&threshold=0
Our suspect from the beginning was the change to AL9 of the test machines, since pre4>pre5 was at almost the same time as the transition of the operative system, but we did not know if this theory made sense since we ignored how RSS mem is managed.
We have seen before reports or suspicion on the EL8/EL9 host OS leading to more memory being used compared to SLC7 https://github.com/cms-sw/cmssw/issues/42929 https://github.com/cms-sw/cmssw/issues/45028
I have discarded this hypotesis now by running pre4 again in AL9.
@jfernan2 Just to confirm I understood correctly, do you mean that you don't see a significant difference in memory usage between pre4 and pre5 when running on the same node or OS (AL9 I guess)?
@AdrianoDee Is there significant difference in sites or host OS versions where the jobs were run between pre4, pre5, pre6, and pre7?
I've copied all the job reports for TTbar PU=200 here
Using the setup from above, I processed the same 10 events of the pre4 input on an EL8 node using pre4 and pre5
- pre4 resulted in peak RSS
14133.7 Mbytesand VSIZE29524.8 Mbytes - pre5 resulted in peak RSS
14329.5 Mbytes(+1.4 %) and VSIZE30748.9 Mbytes(+4.1 %)
The number of processed events was small, and real statistical conclusions would require more runs, but at least the numbers are not outrageously different.
@jfernan2 Just to confirm I understood correctly, do you mean that you don't see a significant difference in memory usage between pre4 and pre5 when running on the same node or OS (AL9 I guess)?
My initial conclusion yesterday was that I was seeing NO difference between AL9 and older OS, based on this plot for wf11834.21:
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html
where both pre4, pre5 (and pre6) run on AL9, and, except for the very fist few events, RSS was lower for pre4 than pre5 (and pre6)
However, this night ended my test on another crosscheck wf12634 in pre4 using AL9, and things have changed:
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_12634.21.html
pre4 is now at the RSS levels of pre5 (pre6 and pre7) for more than the half of the first events.
So, it looks like there is some dependence on AL9 and the way it handles the RSS....
I am running some extra tests on pre3 using AL9 (all releases except pre4, pre5, pre6 and pre7 in those plots were made before AL9 entered into the game) and with a Phase2 wf25034.21 using pre4 AL9 in order to confirm.
I also noticed that the trend in RSS over the job is very different in MC (high at the beginning and then dropping) wfs w.r.t. those in data (increasing over time), see e.g. https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_136.889.html https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html
although the number of events in the task for data wfs is an order of magnitude higher than for mc wfs. I am also testing those again in pre4 using AL9 to have another independent view
On our side we submit all the RelVals with SCRAM_ARCH=el8_amd64_gcc12 so I suppose they all run in el8.
That's the case too in our profiling tests but the machine says at the beginning of the log: Building remotely on vocms011 (el9 GenuineIntel reco-profiling no_label cpu-32 amd64)
Ah so there's some possible loop-hole, let me check more carefully.
After my last profiling tests in pre3 and pre4 with other MC and data workflows, I confirm that running in AL9 increases the peak RSS by a 20-30% in the same job w.r.t. previous OS versions. The RSS gets increased at the beginning of the task/job and then drops for the last events. I will update the plots above with the latest tests later, you can have all the results in text mode at the following folder (dates from Aug and Sep mean AL9):
ls -ld /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_14_1_0_pre?/el8_amd64_gcc12/*/step3_TimeMemoryInfo.log | grep Sep
On our side we submit all the RelVals with
SCRAM_ARCH=el8_amd64_gcc12so I suppose they all run inel8.
Well, yes and no. The jobs will use the el8 binaries, and thus require either el8 host OS or an el8 container, but in the container case, the host OS can be (nearly) anything (as far as scram and CMSSW are concerned).
Hi, Is the https://github.com/cms-sw/cmssw/issues/45854#issuecomment-2333615751 answered the increasing of Memory? One thing if you don't spot anything on RECO is DQM. DQM runs on the same step with RECO, so I don't think we can separate memory used for DQM or RECO step in relvals. I see several DQM-related PR added to in 14_1_0_pre5. Just a guess if we don't spot the issue from RECO.
Hi @srimanob The RECO profiling tests show that running in AL9 produces an increase of the RSS memory of 20-30% in the first part of the job, DQM part is excluded in these jobs.
Right @jfernan2 I mean if this 20% increase answers the issue, then we focus on it. If not, we need to look on something beyond RECO, and DQM may be the next target as it is not part of profiling.
Do we know if job crash in very beginning, sometimes later.
No, there is no job crash, just RSS memory is increased at the beginning of the job, see e.g. https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html
No, there is no job crash, just RSS memory is increased at the beginning of the job, see e.g. https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html
I wonder how reproducible is this RSS behavior wrt. the job progress (in terms of event number)? I.e. how the RSS vs "time" plot would look like if you'd run the same job (e.g. pre5) several times on the same node?
I can'y give you an exact answer, but based on the graphs I sent you for several pre-releases, I would expect fluctuations around the mean value as those given in first left plot of https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html
of the order of 400MB around a mean central value. What is probably more significant is the RSS peak value at the left column of table in the link above
Hi @AdrianoDee
Looking on absolute memory we use in pre5 on Alma9, I try to run as RelVals, 100 jobs, 100 events/job, and what I observe is the RSS is quite stable, a little bit lower than 15 GB. RSS is typically larger when compared to PSS, so I don't understand why we have very large PSS. That means RSS is even larger or equal.
For current action, 1 stream should allow you to survive in 16 GB PSS. I also measure it.
8 Threads, 2 Streams
8 Threads, 1 Stream
My last test repeating on AL9 the pre3 shows high RSS for the whole job/task, from event 1 to 400: https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html
Same for data wfs on pre4 (on AL9): https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_136.889.html https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html
Would someone be able to compare (again) the behavior of cmsRun (jemalloc), cmsRunTC (TCMalloc), and cmsRunGlibC (glibc malloc)?
@srimanob @jfernan2 Do you run your tests on an otherwise free machine, or is other activity possible? In https://github.com/cms-sw/cmssw/issues/42387 it was reported that in cases where the memory ends up being fragmented (because of being heavily used by many processes) termination of an unrelated process can result in a huge spike in RSS (with jemalloc).
In my case I use the dedicated machine for RECO profiling vocms011 which has two profiling tasks at most running in parallel
In our test cmsRunGlibC is running for the igprof part: e.g. igprof -d -mp -z -o step3_igprofMEM.gz -t cmsRunGlibC cmsRunGlibC step3_igprof.py
but gives no difference in results for pre5 w.r.t. pre4: https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/11834.21/step3/mem_live.1.html https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/11834.21/step3/mem_live.399.html
Full details are here: https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/
In our test cmsRunGlibC is running for the igprof part: e.g. igprof -d -mp -z -o step3_igprofMEM.gz -t cmsRunGlibC cmsRunGlibC step3_igprof.py
but gives no difference in results for pre5 w.r.t. pre4:
I think this is expected behavior. IgProf (and any memory profiler that hooks into the memory allocation functions) reports the memory the application requests, and this correlates with the VSIZE (that includes also "overheads" such as the OS reserving memory in terms of pages of e.g. 4 kB or 64 kB). RSS, on the other hand, depends on the amount of memory that is actually being touched by the application.
For example,
std::vector<int> vec;
vec.reserve(1024*1024);
would (in the leading order) result in 4 MB increase in VSIZE, while RSS would not be impacted. Only code like vec[0] = 0; would result in RSS to increase. And a memory profiler would only see the memory allocation from vec.reserve().
With https://github.com/cms-sw/cmssw/issues/45854#issuecomment-2340904341 I really meant keeping all other things equal, and just using cmsRunTC / cmsRunGlibC instead of cmsRun to run the configuration.