pyNetLogo icon indicating copy to clipboard operation
pyNetLogo copied to clipboard

Multiprocessing: Ubuntu VM out of memory and kills processes

Open RomanRietsche opened this issue 3 years ago • 1 comments

Dear pyNetLogo Developers

first, thank you very much for your great work!

I have a problem when running pyNetLogo with multiprocessing. I ran the code below on a linux server and the issues is that the servers runs out of memory and starts to kill processes. Below you find the setting for Linux, the error message and the python code.

The processes relativly quickly are around 800 MB and then after couple of minutes hit 1 GB. If this is the case the OS kills the process and the pool respawns a process and the routine repeats itselve.

We assume the problem is that the Java VM takes too much ram.

Does anyone have an idea what we could do?

Thank you very much.

Environment Linux

  • Openstack VM
  • OS: Ubuntu
  • 16 VCPUs
  • 16 GB RAM

Error Message on Linux

Apr 25 19:05:01 netlogo01 kernel: [ 7097.004932] GC Thread#9 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.004954] CPU: 2 PID: 25671 Comm: GC Thread#9 Not tainted 5.4.0-109-generic #123-Ubuntu Apr 25 19:05:01 netlogo01 kernel: [ 7097.004955] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014 Apr 25 19:05:01 netlogo01 kernel: [ 7097.004960] Call Trace: Apr 25 19:05:01 netlogo01 kernel: [ 7097.005000]  dump_stack+0x6d/0x8b Apr 25 19:05:01 netlogo01 kernel: [ 7097.005004]  dump_header+0x4f/0x1eb Apr 25 19:05:01 netlogo01 kernel: [ 7097.005006]  oom_kill_process.cold+0xb/0x10 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005011]  out_of_memory+0x1cf/0x4d0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005017]  __alloc_pages_slowpath+0xd5e/0xe50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005020]  __alloc_pages_nodemask+0x2d0/0x320 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005023]  alloc_pages_current+0x87/0xe0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005028]  __page_cache_alloc+0x72/0x90 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005030]  pagecache_get_page+0xbf/0x300 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005032]  filemap_fault+0x6b2/0xa50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005036]  ? unlock_page_memcg+0x12/0x20 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005038]  ? page_add_file_rmap+0xff/0x1a0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005041]  ? filemap_map_pages+0x24c/0x380 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005045]  ext4_filemap_fault+0x32/0x50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005048]  __do_fault+0x3c/0x130 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005050]  do_fault+0x24b/0x640 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005051]  __handle_mm_fault+0x4c5/0x7a0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005053]  handle_mm_fault+0xca/0x200 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005057]  do_user_addr_fault+0x1f9/0x450 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005059]  __do_page_fault+0x58/0x90 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005061]  ? schedule+0x42/0xb0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005063]  do_page_fault+0x2c/0xe0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005067]  do_async_page_fault+0x39/0x70 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005071]  async_page_fault+0x34/0x40 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005083] RIP: 0033:0x7fa66c0b0160 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005097] Code: Bad RIP value. Apr 25 19:05:01 netlogo01 kernel: [ 7097.005098] RSP: 002b:00007fa5965f4a18 EFLAGS: 00010246 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005100] RAX: 00007fa622eaccc9 RBX: 000000000445b000 RCX: 0000000000000007 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005101] RDX: 00007fa622eacb80 RSI: 00007fa622eaccc9 RDI: 00007fa5965f4a50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005102] RBP: 00007fa5965f5cd0 R08: 00007fa622eacb90 R09: 00007fa622eac000 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005102] R10: 00007fa676372000 R11: 000000000000000b R12: 00007fa66cc15568 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005103] R13: 00007fa5965f4a50 R14: 00007fa5965f4a20 R15: 00007fa66b779918 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005109] Mem-Info: Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] active_anon:3989999 inactive_anon:46 isolated_anon:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119]  active_file:85 inactive_file:41 isolated_file:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119]  unevictable:4619 dirty:7 writeback:0 unstable:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119]  slab_reclaimable:12889 slab_unreclaimable:33002 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119]  mapped:2107 shmem:269 pagetables:10297 bounce:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119]  free:33828 free_pcp:0 free_cma:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005122] Node 0 active_anon:15959996kB inactive_anon:184kB active_file:340kB inactive_file:164kB unevictable:18476kB isolated(anon):0kB isolated(file):0kB mapped:8428kB dirty:28kB writeback:0kB shmem:1076kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Apr 25 19:05:01 netlogo01 kernel: [ 7097.005126] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005129] lowmem_reserve[]: 0 2911 15902 15902 15902 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005132] Node 0 DMA32 free:64272kB min:12360kB low:15448kB high:18536kB active_anon:2987768kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB writepending:8kB present:3129196kB managed:3063660kB mlocked:0kB kernel_stack:448kB pagetables:6108kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005135] lowmem_reserve[]: 0 0 12991 12991 12991 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005137] Node 0 Normal free:55132kB min:55156kB low:68944kB high:82732kB active_anon:12972228kB inactive_anon:184kB active_file:416kB inactive_file:0kB unevictable:18476kB writepending:20kB present:13631488kB managed:13310972kB mlocked:18476kB kernel_stack:9008kB pagetables:35080kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005141] lowmem_reserve[]: 0 0 0 0 0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005143] Node 0 DMA: 14kB (U) 08kB 016kB 132kB (U) 264kB (U) 1128kB (U) 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15908kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005149] Node 0 DMA32: 1384kB (UME) 1198kB (UME) 6016kB (UME) 11232kB (UME) 4964kB (UME) 19128kB (UME) 4256kB (U) 1512kB (U) 31024kB (UME) 02048kB 124096kB (M) = 65376kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005157] Node 0 Normal: 14034kB (UME) 5898kB (UME) 53616kB (UME) 21832kB (UME) 11864kB (UME) 47128kB (UME) 7256kB (ME) 3512kB (ME) 11024kB (M) 12048kB (U) 3*4096kB (M) = 58132kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005172] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005173] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005176] 2811 total pagecache pages Apr 25 19:05:01 netlogo01 kernel: [ 7097.005184] 0 pages in swap cache Apr 25 19:05:01 netlogo01 kernel: [ 7097.005185] Swap cache stats: add 0, delete 0, find 0/0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005185] Free swap  = 0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005186] Total swap = 0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005187] 4194169 pages RAM Apr 25 19:05:01 netlogo01 kernel: [ 7097.005187] 0 pages HighMem/MovableOnly Apr 25 19:05:01 netlogo01 kernel: [ 7097.005189] 96534 pages reserved Apr 25 19:05:01 netlogo01 kernel: [ 7097.005190] 0 pages cma reserved Apr 25 19:05:01 netlogo01 kernel: [ 7097.005190] 0 pages hwpoisoned Apr 25 19:05:01 netlogo01 kernel: [ 7097.005191] Tasks state (memory values in pages):

Apr 25 19:05:01 netlogo01 kernel: [ 7097.005193] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name Apr 25 19:05:01 netlogo01 kernel: [ 7097.005215] [    489]     0   489    16956     1002   110592        0          -250 systemd-journal Apr 25 19:05:01 netlogo01 kernel: [ 7097.005219] [    520]     0   520     5412      928    73728        0         -1000 systemd-udevd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005232] [    635]     0   635    70035     4488    98304        0         -1000 multipathd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005235] [    683]   102   683    22547      723    81920        0             0 systemd-timesyn Apr 25 19:05:01 netlogo01 kernel: [ 7097.005239] [    734]   100   734     6643      874    73728        0             0 systemd-network Apr 25 19:05:01 netlogo01 kernel: [ 7097.005241] [    743]   101   743     5998     1338    86016        0             0 systemd-resolve Apr 25 19:05:01 netlogo01 kernel: [ 7097.005244] [    793]     0   793    59571      586    98304        0             0 accounts-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005246] [    803]     0   803     2134      556    53248        0             0 cron Apr 25 19:05:01 netlogo01 kernel: [ 7097.005248] [    805]   103   805     1876      834    57344        0          -900 dbus-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005251] [    812]     0   812    20489      532    65536        0             0 irqbalance Apr 25 19:05:01 netlogo01 kernel: [ 7097.005253] [    814]     0   814     7319     2701    98304        0             0 networkd-dispat Apr 25 19:05:01 netlogo01 kernel: [ 7097.005262] [    815]   104   815    56086      528    81920        0             0 rsyslogd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005264] [    818]     0   818   495509     3975   376832        0          -900 snapd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005266] [    831]     0   831     4196      578    73728        0             0 systemd-logind Apr 25 19:05:01 netlogo01 kernel: [ 7097.005268] [    835]     0   835      948      490    45056        0             0 atd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005270] [    845]     0   845     1838      445    53248        0             0 agetty Apr 25 19:05:01 netlogo01 kernel: [ 7097.005272] [    863]     0   863     1457      386    45056        0             0 agetty Apr 25 19:05:01 netlogo01 kernel: [ 7097.005274] [    868]     0   868    27026     2692   106496        0             0 unattended-upgr Apr 25 19:05:01 netlogo01 kernel: [ 7097.005276] [    872]     0   872     3043      890    65536        0         -1000 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005278] [    914]     0   914    58178      396    81920        0             0 polkitd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005282] [   1066]  1000  1066     4596     1137    73728        0             0 systemd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005285] [   1067]  1000  1067    42155      815   102400        0             0 (sd-pam) Apr 25 19:05:01 netlogo01 kernel: [ 7097.005301] [   1564]  1000  1564     1776      718    53248        0             0 dbus-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005303] [  23476]  1000 23476   359438    19492   495616        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005305] [  23509]  1000 23509  2427262   397571  4079616        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005307] [  23510]  1000 23510  2378715   389342  4112384        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005310] [  23511]  1000 23511  2393942   401891  4206592        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005313] [  23512]  1000 23512  2412028   472853  4644864        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005316] [  23515]  1000 23515  2394432   433659  4300800        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005318] [  23517]  1000 23517  2475568   494803  4919296        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005320] [  23518]  1000 23518  2381472   553995  5267456        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005321] [  23519]  1000 23519  2414080   425594  4411392        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005323] [  23522]  1000 23522  2396526   471481  4628480        0             0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005325] [  26579]     0 26579     3449     1024    65536        0             0 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005327] [  26685]  1000 26685     3483      663    65536        0             0 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005331] [  26686]  1000 26686     2512      889    61440        0             0 bash Apr 25 19:05:01 netlogo01 kernel: [ 7097.005334] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=python,pid=23518,uid=1000 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005475] Out of memory: Killed process 23518 (python) total-vm:9525888kB, anon-rss:2215972kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:5144kB oom_score_adj:0 Apr 25 19:05:02 netlogo01 kernel: [ 7097.110971] oom_reaper: reaped process 23518 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB

Code:

The idea is to perform the sensitivity analysis for different settings of the base-model.

`
import json import os import shutil from datetime import datetime import numpy as np import pandas as pd import multiprocess as multiprocessing from multiprocessing import set_start_methodset_start_method("spawn") import dill import sys import pyNetLogo from SALib.sample import saltelli import time

  def initializer(modelfile, variable, setting):
      '''initialize a subprocess
  
      Parameters
      ----------
      ----------
      modelfile : str
  
      '''
  
      p = multiprocessing.current_process()
      print("Initializing Process : {}".format(p.name))
  
      if hpc:
          jvm_home = '/data/opt/tools/openjdk-16.0.0/bin/java'
          netlogo_version = '6.2'
          netlogo_home = 'netlogo_6_2_2'
  
      # we need to set the instantiated netlogo
      # link as a global so run_simulation can
      # use it
      global netlogo
  
      if hpc:
          netlogo = pyNetLogo.NetLogoLink(gui=False, netlogo_home=netlogo_home,
                                          netlogo_version=netlogo_version, jvm_home=jvm_home)
      else:
          netlogo = pyNetLogo.NetLogoLink(gui=False)
  
      netlogo.load_model(modelfile)
      netlogo.command('set {0} {1}'.format(variable, setting))
  
  def run_simulation(experiments):
  
      '''run a netlogo model
  
      Parameters
      ----------
      experiments : dict
      '''
  
      #print("Start experiment {}".format(experiments))
  
      # Set the input parameters
      for key, value in experiments.items():
          if key == 'random-seed':
              # The NetLogo random seed requires a different syntax
              netlogo.command('random-seed {}'.format(value))
          else:
              # Otherwise, assume the input parameters are global variables
              netlogo.command('set {0} {1}'.format(key, value))
  
      netlogo.command('setup')
  
      counts = netlogo.repeat_report([dv], ticks, include_t0=False)
  
      results = pd.Series([counts[dv].values.max()], index=[dv])
      
      return results
  
  
  current_module = sys.modules[__name__]
  work = [current_module]
  dill.detect.trace(True)
  dill.pickles(work)
  
  if __name__ == '__main__':
  
  ....
      for variable, settings in dct_model_settings.items():
          for setting in settings:
  
              path_model_results = os.path.join(path_results, '{}_{}.csv'.format(variable, setting))
  
              if not os.path.exists(path_model_results):
                  os.mkdir(path_model_results)
  
              t0 = time.time()
          
              p = 2 * len(dct_sa_settings.keys()) + 2
  
              print("Start experiments:{} -- v: {} -- set:{}".format(exp_runs * p, variable, setting))
          
              ### Create experiments
          
              experiments = pd.DataFrame(param_values, columns=dct_problem['names'])
          
              ### Multiprocessing
          
              map_result = multiprocessing.Pool(None,
                                                initializer=initializer,
                                                initargs=(modelfile, variable, setting,)).map(run_simulation,
                                                                           experiments.to_dict('records'))
          
              print(map_result)
          
              results = pd.DataFrame(map_result)

`

RomanRietsche avatar Apr 28 '22 08:04 RomanRietsche

A couple of thoughts

  • you create a new pool several times. I would try to move the pool out of the for loop
  • what is the idea of the dill related code?
  • how much ram does the netlogo model require?
  • you can limit the number of processes in the pool in light of the available RAM

quaquel avatar Apr 28 '22 11:04 quaquel