root cling jit can hit VMA limit

Check duplicate issues.

[X] Checked for duplicates

Description

hi -

Recently, we had an ATLAS bug report (https://its.cern.ch/jira/browse/ATR-28411) in which a job that loaded many dictionaries was starting to fail with errors like

cling JIT session error: Cannot allocate memory

Although the job in question was not small, it was also not close to hitting any limit on the process virtual size, which made this error confusing. It turned out that what was happening was that the job was hitting the limit on the number of VMAs, which on lxplus is just about 64k.

A simple script which demonstrates the issue is given below. When i run this on lxplus, it fails after about 21600 iterations, at which point we have

 21600 vmsize    1.2G nvma  65469

So the total vmsize still (relatively...) quite small, but the number of VMAs has bumped up against the system limit.

Looking at a small part of the memory map:

7fb2d892f000-7fb2d8930000 r-xp 00000000 00:00 0 
7fb2d8930000-7fb2d8931000 rw-p 00000000 00:00 0 
7fb2d8931000-7fb2d8932000 r--p 00000000 00:00 0 
7fb2d8932000-7fb2d8933000 r-xp 00000000 00:00 0 
7fb2d8933000-7fb2d8934000 rw-p 00000000 00:00 0 
7fb2d8934000-7fb2d8935000 r--p 00000000 00:00 0 
7fb2d8935000-7fb2d8936000 r-xp 00000000 00:00 0

So each time the jitter runs, it is producing a section of executable memory, a section of read-only memory, and a section of read-write memory. This would also be a performance issue.

The jitter should somehow be more intelligent about grouping together regions of the same protection. This may however be in tension with other requirements, such as this comment from IncrementalJit.cpp:

  // A memory manager for Cling that reserves memory for code and data sections
  // to keep them contiguous for the emission of one module. This is required
  // for working exception handling support since one .eh_frame section will
  // refer to many separate .text sections. However, stack unwinding in libgcc
  // assumes that two unwinding objects (for example coming from two modules)
  // are non-overlapping, which is hard to guarantee with separate allocations
  // for the individual code sections.

This was fixed in the original ATLAS job that prompted this by removing the behavior that was triggering auto-parsing. Thus, this is not a priority for us. But i wanted to go ahead and submit this in order to document the issue.

Reproducer

import ROOT

ROOT.gInterpreter.ProcessLine ('void aaa() {}')

def once():
    ROOT.gInterpreter.ProcessLine ('aaa();')
    return


def printstats (i):
    vmsize = int (open('/proc/self/statm').read().split()[0])
    nvma = len (open('/proc/self/maps').readlines())
    print (f'{i:6} vmsize {vmsize*4/1024/1024:6.2}G nvma {nvma:6}')
    return


for i in range(30000):
    if i%100 == 0: printstats (i)
    once()

ROOT version

6.28.08

Installation method

LCG_104b_ATLAS_3

Operating system

lxplus9

Additional context

No response

Nov 30 '23 20:11 scott-snyder

Thanks a lot, Scott!

Dec 01 '23 13:12 Axel-Naumann

I hope it's not adding noise, but I do want to mention that it's possible to hit this issue in completely unexpected (and, from the user perspective, hard to avoid) ways. For example, TASImage::FromPad has the line

gVirtualPS = (TVirtualPS*)gROOT->ProcessLineFast("new TImageDump()");

and so the following will crash on lxplus:

import ROOT
ROOT.gROOT.SetBatch()
h = ROOT.TH1F()
img = ROOT.TImage.Create()
pad = ROOT.TCanvas()
h.Draw()
for i in range(30000):
  img.FromPad(pad)

without an obvious workaround. This winds up being a blocker for ATLAS data quality.

Mar 07 '24 01:03 ponyisi

I acknowledge this issue needs a detailed discussion within the ROOT team. Now, focussing on ATLAS Data Quality (Monitoring?), would fixing TASImage::FromPad put that part of the processing in a safe place? No need to say this would be only a mitigation.

Mar 13 '24 11:03 dpiparo

Hi @dpiparo , indeed fixing that one particular method will solve the problem for us as far as we know. We have actually been trying in parallel to implement something like your #14960 so this should be very helpful indeed.

Mar 17 '24 16:03 ponyisi

Thanks for the clarification. Since the mitigation is effective for ATLAS, I would like to understand for which releases the backport is required, if at all. Do you need the backport for 6.28? And 6.30?

Mar 17 '24 17:03 dpiparo

Closing. Please do not hesitate to re-open in case this is still an issue for ATLAS.

Aug 09 '24 08:08 dpiparo