joblib icon indicating copy to clipboard operation
joblib copied to clipboard

The joblib opened too many files while running.

Open esse-byte opened this issue 1 year ago • 0 comments

I have also asked a question on Stockoverflow, with slight differences.

A simplified case:

## lsof.py, python 3.11, joblib 1.4(also test in 1.4.2)
from joblib import Parallel, delayed
import time
import sys
import pandas as pd


class Tasker:
    def __init__(self):
        self.data = pd.Series([])

    def run(self):
        time.sleep(10)
        return 1.0

def get_num_of_opened_files() -> tuple[int, int]:
    from subprocess import run
    return int(run('lsof | wc -l', shell=True, capture_output=True, text=True).stdout.strip()), \
           int(run('lsof | grep \\.so$ | wc -l', shell=True, capture_output=True, text=True).stdout.strip())


tasker = Tasker()
f0, s0 = get_num_of_opened_files()
xs = Parallel(n_jobs=32, return_as='generator')(delayed(tasker.run)() for _ in range(32))
time.sleep(2)
f1, s1 = get_num_of_opened_files()

print(f'Opened files: before {f0}, after {f1}, delta all {f1 - f0}, delta so: {s1 - s0}', flush=True)
print(sum(xs))

Run above py script will got something like:

>> python lsof.py
>> Opened files: before 13924, after 77428, delta all 63504, delta so: 40012

The joblib opened about 60,000 files!!! And If I running 10 programs like this, the joblib will claim that: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

**Or even raise an error(my server with 2T free memory): ** A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

esse-byte avatar May 06 '24 06:05 esse-byte

Thanks for the snippet, unfortunately I can not reproduce. I ran it 10 times and I don't see any .so increase:

Opened files: before 1055247, after 1056537, delta all 1290, delta so: 0
32.0
Opened files: before 1058787, after 1064277, delta all 5490, delta so: 0
32.0
Opened files: before 1072965, after 1069416, delta all -3549, delta so: 0
32.0
Opened files: before 1061122, after 1064641, delta all 3519, delta so: 0
32.0
Opened files: before 1091746, after 1093073, delta all 1327, delta so: 0
32.0
Opened files: before 1099325, after 1103940, delta all 4615, delta so: 0
32.0
Opened files: before 1098851, after 1112009, delta all 13158, delta so: 0
32.0
Opened files: before 1096286, after 1089980, delta all -6306, delta so: 0
32.0
Opened files: before 1086131, after 1081671, delta all -4460, delta so: 0
32.0
Opened files: before 1065784, after 1066550, delta all 766, delta so: 0
32.0

I would suggest you try to figure out what these additional .so files from lsof are and whether they are related to joblib in any way ...

Without more info, I am going to wild-guess that this is somehow caused by a combination of different factors like hardware/OS/python environment ... figuring out the combination of factors would probably take time and then the question would be whether joblib can do something about it.

I am going to close this one, but feel free to reopen if you have more insights into the situation and think joblib could do something about it.

lesteve avatar Feb 19 '25 08:02 lesteve