billiard icon indicating copy to clipboard operation
billiard copied to clipboard

BUG: Pool not closing pipes after close and join

Open iGeophysix opened this issue 6 years ago • 16 comments

I'm using billiard.Pool to run parallel computations. I have a continuous service (celery) doing it and after a while I get an OSError 24 ("Too many open files"). While debugging I found that a Pool creates a number of pipes: in pool.py while doing _setup_queues()

self._inqueue = self._ctx.SimpleQueue() # creates 2 pipes
self._outqueue = self._ctx.SimpleQueue() # creates 2 pipes

and then 2 for each process in the Pool.

When joining, it closes pipes for each of the process but not 4 for _inqueue and _outqueue. Those become open forever till the MAIN process is closed. Code to reproduce:

import billiard as mp
import os

def f(a):
    return a + 1


if __name__ == '__main__':
    pid = os.getpid()
    get_number_of_conns = os.popen(f'ls -l /proc/{pid}/fd | wc -l').read()
    print(f'At the beginning we only have {get_number_of_conns.strip()} connections')
    for i in range(10):
        # creating a pool
        pool = mp.Pool(mp.cpu_count() - 1)
        # running a job
        result = pool.map(f, range(5))
        # closing the pool and joining
        pool.close()
        pool.join()
        # getting number of open connections
        get_number_of_conns = os.popen(f'ls -l /proc/{pid}/fd | wc -l').read()
        print(f'Open connections: {get_number_of_conns.strip()}')

iGeophysix avatar Jul 24 '19 08:07 iGeophysix

Any movement on this? Seems like an easy fix

kiansheik avatar Nov 12 '19 23:11 kiansheik

I don't think there was. We have so much on our plate and we don't get to address all the issues.

If you have a solution, please provide us with a PR.

thedrow avatar Nov 13 '19 12:11 thedrow

We are also experiencing this problem, which prevents us from using the implemented multiprocessing solution and forces us to use a 3rd party one (multiprocess). Could you please investigate this behaviour?

clanzett avatar Nov 19 '19 06:11 clanzett

I found a POSSIBLE solution:

to use: https://stackoverflow.com/questions/21485319/high-memory-usage-using-python-multiprocessing

maxtasksperchild=

Can somebody super fast improve it_? or can i_?

lukaspistelak avatar Nov 30 '19 12:11 lukaspistelak

We're not using multiprocessing at all.

thedrow avatar Dec 01 '19 12:12 thedrow

I took a quick look and it does seem that the Pipe object should be closed if __del__ is ever called. See 👇 https://github.com/celery/billiard/blob/0391a4bfe121345f2961b2475e198399aaebceee/billiard/connection.py#L155-L159

Unfortunately, we seem to have a cyclic reference somewhere which prevents __del__ from being called.

thedrow avatar Dec 01 '19 13:12 thedrow

Oh this gets worse because terminating the pool requires the outqueue. The best thing I can do here is to include closing those pipes when terminating.

With the fix I currently have you'll have to do the following:

import gc

import billiard as mp
import os

def f(a):
    return a + 1


if __name__ == '__main__':
    pid = os.getpid()
    get_number_of_conns = os.popen(f'ls -l /proc/{pid}/fd | wc -l').read()
    print(f'At the beginning we only have {get_number_of_conns.strip()} connections')
    for i in range(10):
        # creating a pool
        pool = mp.Pool(mp.cpu_count() - 1)
        # running a job
        result = pool.map(f, range(5))
        # closing the pool and joining
        pool.close()
        pool.join()
        pool.terminate()
        # getting number of open connections
        get_number_of_conns = os.popen(f'ls -l /proc/{pid}/fd | wc -l').read()
        print(f'Open connections: {get_number_of_conns.strip()}')

    get_number_of_conns = os.popen(f'ls -l /proc/{pid}/fd | wc -l').read()
    print(f'At the end we only have {get_number_of_conns.strip()} connections')

thedrow avatar Dec 01 '19 13:12 thedrow

I pushed that fix.

thedrow avatar Dec 01 '19 14:12 thedrow

@clanzett Can you check my partial fix?

thedrow avatar Dec 05 '19 09:12 thedrow

I am on it. Unfortunately this could take a while because the error only happens after approx. 1h of runtime of our jobs. I will keep you posted. Anyway thx for the fix!!

clanzett avatar Dec 05 '19 10:12 clanzett

@thedrow : Ok. Your change seems to fix the problem. Great job and many thanks!! Is there any ETA when the new master branch will find its way into a new python package?

clanzett avatar Dec 05 '19 14:12 clanzett

I'm working a Celery release so very soon.

thedrow avatar Dec 05 '19 15:12 thedrow

Is there any progress with this one? Stuck with the issue, too. Thanks!

UPD. I messed it up a bit, I'm stuck with #217, a similar one.

xenohunter avatar Jan 29 '20 11:01 xenohunter

I have the same issue, The performance is very good when I was using the pool with the with keyword, but when I switched to terminating the pool explicitly as suggested, the performance went down.

RahulMudupuri avatar Feb 24 '21 13:02 RahulMudupuri

I currently don't have a better fix for this problem. Feel free to suggest one.

thedrow avatar Feb 28 '21 10:02 thedrow

@celery/core-developers This is a big problem for us. If anyone has the time to investigate, please do.

thedrow avatar Feb 28 '21 10:02 thedrow