The joblib opened too many files while running.
I have also asked a question on Stockoverflow, with slight differences.
A simplified case:
## lsof.py, python 3.11, joblib 1.4(also test in 1.4.2)
from joblib import Parallel, delayed
import time
import sys
import pandas as pd
class Tasker:
def __init__(self):
self.data = pd.Series([])
def run(self):
time.sleep(10)
return 1.0
def get_num_of_opened_files() -> tuple[int, int]:
from subprocess import run
return int(run('lsof | wc -l', shell=True, capture_output=True, text=True).stdout.strip()), \
int(run('lsof | grep \\.so$ | wc -l', shell=True, capture_output=True, text=True).stdout.strip())
tasker = Tasker()
f0, s0 = get_num_of_opened_files()
xs = Parallel(n_jobs=32, return_as='generator')(delayed(tasker.run)() for _ in range(32))
time.sleep(2)
f1, s1 = get_num_of_opened_files()
print(f'Opened files: before {f0}, after {f1}, delta all {f1 - f0}, delta so: {s1 - s0}', flush=True)
print(sum(xs))
Run above py script will got something like:
>> python lsof.py
>> Opened files: before 13924, after 77428, delta all 63504, delta so: 40012
The joblib opened about 60,000 files!!!
And If I running 10 programs like this, the joblib will claim that:
UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
**Or even raise an error(my server with 2T free memory): **
A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
Thanks for the snippet, unfortunately I can not reproduce. I ran it 10 times and I don't see any .so increase:
Opened files: before 1055247, after 1056537, delta all 1290, delta so: 0
32.0
Opened files: before 1058787, after 1064277, delta all 5490, delta so: 0
32.0
Opened files: before 1072965, after 1069416, delta all -3549, delta so: 0
32.0
Opened files: before 1061122, after 1064641, delta all 3519, delta so: 0
32.0
Opened files: before 1091746, after 1093073, delta all 1327, delta so: 0
32.0
Opened files: before 1099325, after 1103940, delta all 4615, delta so: 0
32.0
Opened files: before 1098851, after 1112009, delta all 13158, delta so: 0
32.0
Opened files: before 1096286, after 1089980, delta all -6306, delta so: 0
32.0
Opened files: before 1086131, after 1081671, delta all -4460, delta so: 0
32.0
Opened files: before 1065784, after 1066550, delta all 766, delta so: 0
32.0
I would suggest you try to figure out what these additional .so files from lsof are and whether they are related to joblib in any way ...
Without more info, I am going to wild-guess that this is somehow caused by a combination of different factors like hardware/OS/python environment ... figuring out the combination of factors would probably take time and then the question would be whether joblib can do something about it.
I am going to close this one, but feel free to reopen if you have more insights into the situation and think joblib could do something about it.