Memory error after reinitializing and stopping a 2 LIF proccess network multiple times
Objective of issue: Allow network to be re-instantiated and run an infinite number of times without memory issues so that lava can be used for off-line training.
Lava version:
- [ ] 0.4.0
- [ ] 0.3.0
I'm submitting a ...
- [ x] bug report
Current behavior:
- After reinitializing and running a network about 2000 times I get a memory error. If I am trying to recreate the network with new weights during a training, I wouldn't be able to train for sufficient iterations.
Expected behavior:
- I'd expect that we should be able to theoretically reinitialize and run the network an infinite number of times without seeing a memory error.
Steps to reproduce:
- Run the minimal example below.
Related code:
import numpy as np
from lava.magma.core.run_conditions import RunSteps
from lava.proc.lif.process import LIF
from lava.proc.dense.process import Dense
from lava.magma.core.run_configs import Loihi1SimCfg
num_steps = 5000
du = 10
dv = 100
vth = 4900
if __name__ == "__main__":
for k in range(num_steps):
# Create processes
lif1 = LIF(shape=(3, ),
vth=vth,
dv=dv,
du=du,
bias_mant=(1, 3, 2),
name="lif1")
dense = Dense(weights=np.random.rand(2, 3), name='dense')
lif2 = LIF(shape=(2, ),
vth=vth,
dv=dv,
du=du,
bias_mant=0,
name='lif2')
lif1.s_out.connect(dense.s_in)
dense.a_out.connect(lif2.a_in)
lif2.run(condition=RunSteps(num_steps=10),
run_cfg=Loihi1SimCfg(select_tag="fixed_pt"))
lif2.stop()
print("k = "+str(k))
Other information:
- I see an error at around the same point for both for 0.3.0 and 0.4.0, although the error output is a bit different.
- I noticed that inside of stop() of runtime.py,
self.join()doesn't appear to be working properly, and the runtime services don't get killed (when I run in debug mode). However if I putself.join()immediately belowif self._is_started:, then the runtime service threads get terminated appropriately. I'm not sure why this is. - The fix from 2 allows a few hundred more iterations of the the code before getting a memory issue, but I still see the memory error eventually.
The output error I'm seeing (before the fix from 2) with 0.4.0 is:
k = 1883
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jashkanazy/dev/lava_tests/simple_thread_test.py", line 34, in <module>
lif2.run(condition=RunSteps(num_steps=10),
File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/core/process/process.py", line 343, in run
self._runtime.initialize()
File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/runtime/runtime.py", line 144, in initialize
self._start_ports()
File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/runtime/runtime.py", line 154, in _start_ports
port.start()
File "/home/jashkanazy/dev/lava_tests/lava/src/lava/magma/compiler/channels/pypychannel.py", line 240, in start
self.thread.start()
File "/usr/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
The output error I'm seeing (before the fix from 2) with 0.3.0 is:
k = 1881
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
cache[rtype].remove(name)
KeyError: '/psm_49e392f1'
exec(code, run_globals)
File "/home/jashkanazy-local/dev/sllml/SLLML/src/snn-algos/examples/compare_loihi_to_lava/simple_thread/simple_thread.py", line 78, in <module>
File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/core/process/process.py", line 422, in run
self._runtime.initialize()
File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/runtime/runtime.py", line 138, in initialize
self._build_sync_channels()
File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/runtime/runtime.py", line 199, in _build_sync_channels
channel: Channel = sync_channel_builder.build(
File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/compiler/builders/builder.py", line 747, in build
return channel_class(
File "/home/jashkanazy-local/dev/sllml/sllml_venv/lib/python3.8/site-packages/lava/magma/compiler/channels/pypychannel.py", line 337, in __init__
shm = smm.SharedMemory(int(nbytes * size))
File "/usr/lib/python3.8/multiprocessing/managers.py", line 1386, in SharedMemory
sms = shared_memory.SharedMemory(None, create=True, size=size)
File "/usr/lib/python3.8/multiprocessing/shared_memory.py", line 113, in __init__
self._mmap = mmap.mmap(self._fd, size)
OSError: [Errno 12] Cannot allocate memory
Then here is the error with 0.4.0 after the fix on runtime.py
k = 2134
Process SystemProcess-10677:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
util._run_after_forkers()
File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10678:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
util._run_after_forkers()
File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10680:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
util._run_after_forkers()
File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
items = list(_afterfork_registry.items())
MemoryError
Process SystemProcess-10679:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 308, in _bootstrap
util._run_after_forkers()
File "/usr/lib/python3.8/multiprocessing/util.py", line 163, in _run_after_forkers
items = list(_afterfork_registry.items())
MemoryError
Thank you for reporting this, we need to look into it.
We are currently also working on replacing the multiprocessing library with a more sophisticated solution, which might also help with this problem.
I took a look and it's difficult to reproduce this behavior for me. The simulation gets slower and slower and after only a couple hundred of iterations it's unbearably slow. So, I can at least agree that there is something going wrong which could be a memory issue.
From the description of the problem, it looks like the threads are not getting properly closed. I added print("active threads", threading.active_count()) in the loop of the test script above and saw that the number of active threads is indeed increasing.
active threads 1
k = 0
active threads 3
k = 1
active threads 5
k = 2
active threads 7
k = 3
active threads 9
k = 4
active threads 11
k = 5
I found that adding self.send(np.zeros(self._shape)) in line 134 under self._done = True in pypychannel.py and self.recv() in line 282 helps in the sense that the number of active threads is not increasing anymore. Unfortunately, the speed of execution is still decreasing over iterations.
This is a known issue with Python's shared memory implementation leaking file descriptors and eventually OS throwing this error. We are working on a C++ based Shared Memory implementation and a design overall of the message passing architecture keeping the Channel APIs intact (no user code change). That should fix this issue. I don't have a date currently when it will be merged.
Based on discussion yesterday, it sounds like 3 weeks the merge will be done and we hope to make it part of the next release.
May I add that there was development in respect of setting weights during runtime. So now you can initialize your network once and set the weights each iteration instead of re-creating the complete network every time. This will speed up execution time drastically and also avoids encountering the Python memory issue.
The script above would look like this:
import numpy as np
from lava.magma.core.run_conditions import RunSteps
from lava.proc.lif.process import LIF
from lava.proc.dense.process import Dense
from lava.magma.core.run_configs import Loihi1SimCfg
num_steps = 5000
du = 10
dv = 100
vth = 4900
if __name__ == "__main__":
# Create processes
lif1 = LIF(shape=(3, ),
vth=vth,
dv=dv,
du=du,
bias_mant=(1, 3, 2),
name="lif1")
dense = Dense(weights=np.random.randint(1, 10, (2, 3)), name='dense')
lif2 = LIF(shape=(2, ),
vth=vth,
dv=dv,
du=du,
bias_mant=0,
name='lif2')
lif1.s_out.connect(dense.s_in)
dense.a_out.connect(lif2.a_in)
for k in range(num_steps):
if k > 0:
dense.weights.set(np.random.randint(1, 10, (2, 3)))
lif2.run(condition=RunSteps(num_steps=10),
run_cfg=Loihi1SimCfg(select_tag="fixed_pt"))
print("k = "+str(k), dense.weights.get())
lif2.stop()
Oh nice! Great news! Is this part of main?
yes.
The only other change you need to do in this script to make it truly equivalent to the one you posted initially is to reset the states (u and v) of the LIF neurons in each iteration.
Wow, that is a lot faster! For LearningDense is there or will there be a similar way to update dw?
As a future "nice to have" suggestion, it would be nice if the built in Lava processes had a reset() function that automatically reset all the internal states to default values.