ipyparallel
                                
                                 ipyparallel copied to clipboard
                                
                                    ipyparallel copied to clipboard
                            
                            
                            
                        LoadBalancedView bloats memory - bug or wrong settings?
This issue may be related to https://github.com/ipython/ipyparallel/issues/207 which is also not marked as solved, yet. Also I posted this problem on stackoverflow (https://stackoverflow.com/questions/45781545/ipyparallels-loadbalancedview-bloats-memory-how-can-i-avoid-that).
I want to execute multiple tasks in parallel using python and ipyparallel in a jupyter notebook and using 4 local engines by executing ipcluster start in a local console.
Besides that one can also use DirectView, I use LoadBalancedView to map a set of tasks. Each task takes around 0.2 seconds (can vary though) and each task does a MySQL query where it loads some data and then processes it. 
Working with ~45000 tasks works fine, however, my memory grows really high. This is actually bad because I want to run another experiment with over 660000 tasks which I can't run anymore because it bloats up my memory limit of 16 GB and then the memory swapping on my local drive starts. However, when using the DirectView my memory grows relatively small and is never full. But I actually need LoadBalancedView. 
Even when running a minimal working example without database query this happens (see below).
I am not perfectly familiar with the ipyparallel library but I've read something about logs and caches that the ipcontroler does which may cause this. I am still not sure if it is a bug or if I can change some settings to avoid my problem.
Running a MWE
For my Python 3.5.3 environment running on Windows 10 I use the following (recent) packages:
- ipython 6.1.0
- ipython_genutils 6.1.0
- ipyparallel 6.0.2
- jupyter 1.0.0
- jupyter_client 4.4.0
- jupyter_console 5.0.0
- jupyter_core 4.2.0
I would like the following example to work for LoadBalancedView without the immense memory growth (if possible at all):
- 
Start ipcluster starton a console
- 
Run a jupyter notebook with the following three cells: <1st cell> import ipyparallel as ipp rc = ipp.Client() lview = rc.load_balanced_view() <2nd cell> %%px --local import time <3rd cell> def sleep_here(i): time.sleep(0.2) return 42 amr = lview.map_async(sleep_here, range(660000)) amr.wait_interactive()
Can anyone confirm this? Imo it is not a huge example to reproduce (except I am missing something)
Yes, I can confirm this. I've been investigating the cause, but haven't nailed it down, yet.
The load balanced view seems to also struggle with large input arguments in terms of performance:
import ipyparallel
client = ipyparallel.Client()
lview = client.load_balanced_view()
dview = client[:]
def execute(view, n):
    for item in view.map(lambda x: x, range(n), block=False):
        return item
%timeit execute(dview, 10) # 17 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit execute(dview, 100) # 14.3 ms ± 322 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit execute(dview, 1000) # 14.8 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit execute(lview, 10) # 45.1 ms ± 7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit execute(lview, 100) # 358 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit execute(lview, 1000) # 3.34 s ± 375 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I came across this as well with very long async queues. Running on macOS 10.12.5 w/ python 2.7.14
I've been investigating this again and still haven't identified the exact cause of the memory growth, but one mechanism to mitigate it is the chunksize argument. I believe the memory growth is related to the number of tasks (messages sent through the scheduler), which doesn't have to match the number of units in the map sequence. Setting chunksize=10 means that each message will include 10 elements of the map. This changes your task size from 200ms to 2s, and reduces the task count from 600k to 60k. The larger the chunksize, the lower the overhead. At the same time, you have coarser tasks so making it too big results in less smooth load balancing.
The new (in 7.0a2) LoadBalancedView.imap that limits the number of outstanding tasks should also greatly improve memory usage, as not all messages will be in flight at once.