Multiprocessing: Ubuntu VM out of memory and kills processes
Dear pyNetLogo Developers
first, thank you very much for your great work!
I have a problem when running pyNetLogo with multiprocessing. I ran the code below on a linux server and the issues is that the servers runs out of memory and starts to kill processes. Below you find the setting for Linux, the error message and the python code.
The processes relativly quickly are around 800 MB and then after couple of minutes hit 1 GB. If this is the case the OS kills the process and the pool respawns a process and the routine repeats itselve.
We assume the problem is that the Java VM takes too much ram.
Does anyone have an idea what we could do?
Thank you very much.
Environment Linux
- Openstack VM
- OS: Ubuntu
- 16 VCPUs
- 16 GB RAM
Error Message on Linux
Apr 25 19:05:01 netlogo01 kernel: [ 7097.004932] GC Thread#9 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.004954] CPU: 2 PID: 25671 Comm: GC Thread#9 Not tainted 5.4.0-109-generic #123-Ubuntu Apr 25 19:05:01 netlogo01 kernel: [ 7097.004955] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014 Apr 25 19:05:01 netlogo01 kernel: [ 7097.004960] Call Trace: Apr 25 19:05:01 netlogo01 kernel: [ 7097.005000] dump_stack+0x6d/0x8b Apr 25 19:05:01 netlogo01 kernel: [ 7097.005004] dump_header+0x4f/0x1eb Apr 25 19:05:01 netlogo01 kernel: [ 7097.005006] oom_kill_process.cold+0xb/0x10 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005011] out_of_memory+0x1cf/0x4d0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005017] __alloc_pages_slowpath+0xd5e/0xe50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005020] __alloc_pages_nodemask+0x2d0/0x320 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005023] alloc_pages_current+0x87/0xe0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005028] __page_cache_alloc+0x72/0x90 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005030] pagecache_get_page+0xbf/0x300 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005032] filemap_fault+0x6b2/0xa50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005036] ? unlock_page_memcg+0x12/0x20 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005038] ? page_add_file_rmap+0xff/0x1a0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005041] ? filemap_map_pages+0x24c/0x380 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005045] ext4_filemap_fault+0x32/0x50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005048] __do_fault+0x3c/0x130 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005050] do_fault+0x24b/0x640 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005051] __handle_mm_fault+0x4c5/0x7a0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005053] handle_mm_fault+0xca/0x200 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005057] do_user_addr_fault+0x1f9/0x450 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005059] __do_page_fault+0x58/0x90 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005061] ? schedule+0x42/0xb0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005063] do_page_fault+0x2c/0xe0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005067] do_async_page_fault+0x39/0x70 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005071] async_page_fault+0x34/0x40 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005083] RIP: 0033:0x7fa66c0b0160 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005097] Code: Bad RIP value. Apr 25 19:05:01 netlogo01 kernel: [ 7097.005098] RSP: 002b:00007fa5965f4a18 EFLAGS: 00010246 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005100] RAX: 00007fa622eaccc9 RBX: 000000000445b000 RCX: 0000000000000007 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005101] RDX: 00007fa622eacb80 RSI: 00007fa622eaccc9 RDI: 00007fa5965f4a50 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005102] RBP: 00007fa5965f5cd0 R08: 00007fa622eacb90 R09: 00007fa622eac000 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005102] R10: 00007fa676372000 R11: 000000000000000b R12: 00007fa66cc15568 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005103] R13: 00007fa5965f4a50 R14: 00007fa5965f4a20 R15: 00007fa66b779918 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005109] Mem-Info: Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] active_anon:3989999 inactive_anon:46 isolated_anon:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] active_file:85 inactive_file:41 isolated_file:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] unevictable:4619 dirty:7 writeback:0 unstable:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] slab_reclaimable:12889 slab_unreclaimable:33002 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] mapped:2107 shmem:269 pagetables:10297 bounce:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005119] free:33828 free_pcp:0 free_cma:0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005122] Node 0 active_anon:15959996kB inactive_anon:184kB active_file:340kB inactive_file:164kB unevictable:18476kB isolated(anon):0kB isolated(file):0kB mapped:8428kB dirty:28kB writeback:0kB shmem:1076kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Apr 25 19:05:01 netlogo01 kernel: [ 7097.005126] Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005129] lowmem_reserve[]: 0 2911 15902 15902 15902 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005132] Node 0 DMA32 free:64272kB min:12360kB low:15448kB high:18536kB active_anon:2987768kB inactive_anon:0kB active_file:4kB inactive_file:0kB unevictable:0kB writepending:8kB present:3129196kB managed:3063660kB mlocked:0kB kernel_stack:448kB pagetables:6108kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005135] lowmem_reserve[]: 0 0 12991 12991 12991 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005137] Node 0 Normal free:55132kB min:55156kB low:68944kB high:82732kB active_anon:12972228kB inactive_anon:184kB active_file:416kB inactive_file:0kB unevictable:18476kB writepending:20kB present:13631488kB managed:13310972kB mlocked:18476kB kernel_stack:9008kB pagetables:35080kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005141] lowmem_reserve[]: 0 0 0 0 0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005143] Node 0 DMA: 14kB (U) 08kB 016kB 132kB (U) 264kB (U) 1128kB (U) 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15908kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005149] Node 0 DMA32: 1384kB (UME) 1198kB (UME) 6016kB (UME) 11232kB (UME) 4964kB (UME) 19128kB (UME) 4256kB (U) 1512kB (U) 31024kB (UME) 02048kB 124096kB (M) = 65376kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005157] Node 0 Normal: 14034kB (UME) 5898kB (UME) 53616kB (UME) 21832kB (UME) 11864kB (UME) 47128kB (UME) 7256kB (ME) 3512kB (ME) 11024kB (M) 12048kB (U) 3*4096kB (M) = 58132kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005172] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005173] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005176] 2811 total pagecache pages Apr 25 19:05:01 netlogo01 kernel: [ 7097.005184] 0 pages in swap cache Apr 25 19:05:01 netlogo01 kernel: [ 7097.005185] Swap cache stats: add 0, delete 0, find 0/0 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005185] Free swap = 0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005186] Total swap = 0kB Apr 25 19:05:01 netlogo01 kernel: [ 7097.005187] 4194169 pages RAM Apr 25 19:05:01 netlogo01 kernel: [ 7097.005187] 0 pages HighMem/MovableOnly Apr 25 19:05:01 netlogo01 kernel: [ 7097.005189] 96534 pages reserved Apr 25 19:05:01 netlogo01 kernel: [ 7097.005190] 0 pages cma reserved Apr 25 19:05:01 netlogo01 kernel: [ 7097.005190] 0 pages hwpoisoned Apr 25 19:05:01 netlogo01 kernel: [ 7097.005191] Tasks state (memory values in pages):
Apr 25 19:05:01 netlogo01 kernel: [ 7097.005193] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Apr 25 19:05:01 netlogo01 kernel: [ 7097.005215] [ 489] 0 489 16956 1002 110592 0 -250 systemd-journal Apr 25 19:05:01 netlogo01 kernel: [ 7097.005219] [ 520] 0 520 5412 928 73728 0 -1000 systemd-udevd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005232] [ 635] 0 635 70035 4488 98304 0 -1000 multipathd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005235] [ 683] 102 683 22547 723 81920 0 0 systemd-timesyn Apr 25 19:05:01 netlogo01 kernel: [ 7097.005239] [ 734] 100 734 6643 874 73728 0 0 systemd-network Apr 25 19:05:01 netlogo01 kernel: [ 7097.005241] [ 743] 101 743 5998 1338 86016 0 0 systemd-resolve Apr 25 19:05:01 netlogo01 kernel: [ 7097.005244] [ 793] 0 793 59571 586 98304 0 0 accounts-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005246] [ 803] 0 803 2134 556 53248 0 0 cron Apr 25 19:05:01 netlogo01 kernel: [ 7097.005248] [ 805] 103 805 1876 834 57344 0 -900 dbus-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005251] [ 812] 0 812 20489 532 65536 0 0 irqbalance Apr 25 19:05:01 netlogo01 kernel: [ 7097.005253] [ 814] 0 814 7319 2701 98304 0 0 networkd-dispat Apr 25 19:05:01 netlogo01 kernel: [ 7097.005262] [ 815] 104 815 56086 528 81920 0 0 rsyslogd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005264] [ 818] 0 818 495509 3975 376832 0 -900 snapd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005266] [ 831] 0 831 4196 578 73728 0 0 systemd-logind Apr 25 19:05:01 netlogo01 kernel: [ 7097.005268] [ 835] 0 835 948 490 45056 0 0 atd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005270] [ 845] 0 845 1838 445 53248 0 0 agetty Apr 25 19:05:01 netlogo01 kernel: [ 7097.005272] [ 863] 0 863 1457 386 45056 0 0 agetty Apr 25 19:05:01 netlogo01 kernel: [ 7097.005274] [ 868] 0 868 27026 2692 106496 0 0 unattended-upgr Apr 25 19:05:01 netlogo01 kernel: [ 7097.005276] [ 872] 0 872 3043 890 65536 0 -1000 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005278] [ 914] 0 914 58178 396 81920 0 0 polkitd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005282] [ 1066] 1000 1066 4596 1137 73728 0 0 systemd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005285] [ 1067] 1000 1067 42155 815 102400 0 0 (sd-pam) Apr 25 19:05:01 netlogo01 kernel: [ 7097.005301] [ 1564] 1000 1564 1776 718 53248 0 0 dbus-daemon Apr 25 19:05:01 netlogo01 kernel: [ 7097.005303] [ 23476] 1000 23476 359438 19492 495616 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005305] [ 23509] 1000 23509 2427262 397571 4079616 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005307] [ 23510] 1000 23510 2378715 389342 4112384 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005310] [ 23511] 1000 23511 2393942 401891 4206592 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005313] [ 23512] 1000 23512 2412028 472853 4644864 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005316] [ 23515] 1000 23515 2394432 433659 4300800 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005318] [ 23517] 1000 23517 2475568 494803 4919296 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005320] [ 23518] 1000 23518 2381472 553995 5267456 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005321] [ 23519] 1000 23519 2414080 425594 4411392 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005323] [ 23522] 1000 23522 2396526 471481 4628480 0 0 python Apr 25 19:05:01 netlogo01 kernel: [ 7097.005325] [ 26579] 0 26579 3449 1024 65536 0 0 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005327] [ 26685] 1000 26685 3483 663 65536 0 0 sshd Apr 25 19:05:01 netlogo01 kernel: [ 7097.005331] [ 26686] 1000 26686 2512 889 61440 0 0 bash Apr 25 19:05:01 netlogo01 kernel: [ 7097.005334] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=python,pid=23518,uid=1000 Apr 25 19:05:01 netlogo01 kernel: [ 7097.005475] Out of memory: Killed process 23518 (python) total-vm:9525888kB, anon-rss:2215972kB, file-rss:0kB, shmem-rss:8kB, UID:1000 pgtables:5144kB oom_score_adj:0 Apr 25 19:05:02 netlogo01 kernel: [ 7097.110971] oom_reaper: reaped process 23518 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB
Code:
The idea is to perform the sensitivity analysis for different settings of the base-model.
`
import json
import os
import shutil
from datetime import datetime
import numpy as np
import pandas as pd
import multiprocess as multiprocessing
from multiprocessing import set_start_methodset_start_method("spawn")
import dill
import sys
import pyNetLogo
from SALib.sample import saltelli
import time
def initializer(modelfile, variable, setting):
'''initialize a subprocess
Parameters
----------
----------
modelfile : str
'''
p = multiprocessing.current_process()
print("Initializing Process : {}".format(p.name))
if hpc:
jvm_home = '/data/opt/tools/openjdk-16.0.0/bin/java'
netlogo_version = '6.2'
netlogo_home = 'netlogo_6_2_2'
# we need to set the instantiated netlogo
# link as a global so run_simulation can
# use it
global netlogo
if hpc:
netlogo = pyNetLogo.NetLogoLink(gui=False, netlogo_home=netlogo_home,
netlogo_version=netlogo_version, jvm_home=jvm_home)
else:
netlogo = pyNetLogo.NetLogoLink(gui=False)
netlogo.load_model(modelfile)
netlogo.command('set {0} {1}'.format(variable, setting))
def run_simulation(experiments):
'''run a netlogo model
Parameters
----------
experiments : dict
'''
#print("Start experiment {}".format(experiments))
# Set the input parameters
for key, value in experiments.items():
if key == 'random-seed':
# The NetLogo random seed requires a different syntax
netlogo.command('random-seed {}'.format(value))
else:
# Otherwise, assume the input parameters are global variables
netlogo.command('set {0} {1}'.format(key, value))
netlogo.command('setup')
counts = netlogo.repeat_report([dv], ticks, include_t0=False)
results = pd.Series([counts[dv].values.max()], index=[dv])
return results
current_module = sys.modules[__name__]
work = [current_module]
dill.detect.trace(True)
dill.pickles(work)
if __name__ == '__main__':
....
for variable, settings in dct_model_settings.items():
for setting in settings:
path_model_results = os.path.join(path_results, '{}_{}.csv'.format(variable, setting))
if not os.path.exists(path_model_results):
os.mkdir(path_model_results)
t0 = time.time()
p = 2 * len(dct_sa_settings.keys()) + 2
print("Start experiments:{} -- v: {} -- set:{}".format(exp_runs * p, variable, setting))
### Create experiments
experiments = pd.DataFrame(param_values, columns=dct_problem['names'])
### Multiprocessing
map_result = multiprocessing.Pool(None,
initializer=initializer,
initargs=(modelfile, variable, setting,)).map(run_simulation,
experiments.to_dict('records'))
print(map_result)
results = pd.DataFrame(map_result)
`
A couple of thoughts
- you create a new pool several times. I would try to move the pool out of the for loop
- what is the idea of the dill related code?
- how much ram does the netlogo model require?
- you can limit the number of processes in the pool in light of the available RAM