cling jit can hit VMA limit
Check duplicate issues.
- [X] Checked for duplicates
Description
hi -
Recently, we had an ATLAS bug report (https://its.cern.ch/jira/browse/ATR-28411) in which a job that loaded many dictionaries was starting to fail with errors like
cling JIT session error: Cannot allocate memory
Although the job in question was not small, it was also not close to hitting any limit on the process virtual size, which made this error confusing. It turned out that what was happening was that the job was hitting the limit on the number of VMAs, which on lxplus is just about 64k.
A simple script which demonstrates the issue is given below. When i run this on lxplus, it fails after about 21600 iterations, at which point we have
21600 vmsize 1.2G nvma 65469
So the total vmsize still (relatively...) quite small, but the number of VMAs has bumped up against the system limit.
Looking at a small part of the memory map:
7fb2d892f000-7fb2d8930000 r-xp 00000000 00:00 0
7fb2d8930000-7fb2d8931000 rw-p 00000000 00:00 0
7fb2d8931000-7fb2d8932000 r--p 00000000 00:00 0
7fb2d8932000-7fb2d8933000 r-xp 00000000 00:00 0
7fb2d8933000-7fb2d8934000 rw-p 00000000 00:00 0
7fb2d8934000-7fb2d8935000 r--p 00000000 00:00 0
7fb2d8935000-7fb2d8936000 r-xp 00000000 00:00 0
So each time the jitter runs, it is producing a section of executable memory, a section of read-only memory, and a section of read-write memory. This would also be a performance issue.
The jitter should somehow be more intelligent about grouping together regions of the same protection. This may however be in tension with other requirements, such as this comment from IncrementalJit.cpp:
// A memory manager for Cling that reserves memory for code and data sections
// to keep them contiguous for the emission of one module. This is required
// for working exception handling support since one .eh_frame section will
// refer to many separate .text sections. However, stack unwinding in libgcc
// assumes that two unwinding objects (for example coming from two modules)
// are non-overlapping, which is hard to guarantee with separate allocations
// for the individual code sections.
This was fixed in the original ATLAS job that prompted this by removing the behavior that was triggering auto-parsing. Thus, this is not a priority for us. But i wanted to go ahead and submit this in order to document the issue.
Reproducer
import ROOT
ROOT.gInterpreter.ProcessLine ('void aaa() {}')
def once():
ROOT.gInterpreter.ProcessLine ('aaa();')
return
def printstats (i):
vmsize = int (open('/proc/self/statm').read().split()[0])
nvma = len (open('/proc/self/maps').readlines())
print (f'{i:6} vmsize {vmsize*4/1024/1024:6.2}G nvma {nvma:6}')
return
for i in range(30000):
if i%100 == 0: printstats (i)
once()
ROOT version
6.28.08
Installation method
LCG_104b_ATLAS_3
Operating system
lxplus9
Additional context
No response
Thanks a lot, Scott!
I hope it's not adding noise, but I do want to mention that it's possible to hit this issue in completely unexpected (and, from the user perspective, hard to avoid) ways. For example, TASImage::FromPad has the line
gVirtualPS = (TVirtualPS*)gROOT->ProcessLineFast("new TImageDump()");
and so the following will crash on lxplus:
import ROOT
ROOT.gROOT.SetBatch()
h = ROOT.TH1F()
img = ROOT.TImage.Create()
pad = ROOT.TCanvas()
h.Draw()
for i in range(30000):
img.FromPad(pad)
without an obvious workaround. This winds up being a blocker for ATLAS data quality.
I acknowledge this issue needs a detailed discussion within the ROOT team. Now, focussing on ATLAS Data Quality (Monitoring?), would fixing TASImage::FromPad put that part of the processing in a safe place? No need to say this would be only a mitigation.
Hi @dpiparo , indeed fixing that one particular method will solve the problem for us as far as we know. We have actually been trying in parallel to implement something like your #14960 so this should be very helpful indeed.
Thanks for the clarification. Since the mitigation is effective for ATLAS, I would like to understand for which releases the backport is required, if at all. Do you need the backport for 6.28? And 6.30?
Closing. Please do not hesitate to re-open in case this is still an issue for ATLAS.