sklearn-compiledtrees
sklearn-compiledtrees copied to clipboard
OSError: [Errno 24] Too many open files when RandomForestRegressor has 140 estimators
Here's a loop that fits and compiles trees, stepping up the number of estimators each time:
from sklearn import datasets, ensemble
import compiledtrees
data = datasets.load_boston()
X, y = data.data, data.target
for i in range(20, 250, 20):
print(i)
model = ensemble.RandomForestRegressor(n_jobs=4, n_estimators=i)
model.fit(X, y)
model = compiledtrees.CompiledRegressionPredictor(model)
h = model.predict(X)
It crashes on 140:
$ python test_script.py
20
40
60
80
100
120
140
Traceback (most recent call last):
File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/_parallel_backends.py", line 344, in __call__
return self.func(*args, **kwargs)
File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 173, in _compile
_call([CXX_COMPILER, cpp_f, "-c", "-fPIC", "-o", o_f.name, "-O3", "-pipe"])
File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 179, in _call
shell=True, stdout=DEVNULL, stderr=DEVNULL)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 576, in check_call
retcode = call(*popenargs, **kwargs)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 557, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1454, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
This is on mac OS.
I haven't looked into workarounds - perhaps I can increase the number of files that can be open at once. But if there's a way to limit the open files in the library, that would probably be better.
I had a look at code_gen.py. Perhaps the CodeGenerator class could build a string instead of opening and writing to a file. When the .file method is called, it could write to a file, close it and return the name.
On Linux and macOS you have to do simply issue ulimit -n 2048
. By design compiling trees consumes 2 * n_trees + 2
open files.
On Windows there is no way to raise the limit globally, but there is an internal solution, which you have to include in your script:
import platform
if platform.system() == 'Windows':
import win32file
win32file._setmaxstdio(2048)
I used to write one cpp file, but it didn't work for large forests - especially if you have lots of data and allow for full growth. For my example this translate to 500 .cpp files over 100MB (50GB+ of RAM). Keeping all those files in StringIO's would probably work, although .o files would also still be there, so we would go down to ntrees + 2
open files (assuming we successfully close/delete files after compiling them to .o).
To sum up - I regard it as not an issue, and overcoming it would probably cost a lot of RAM in return, which ultimately is a deal-breaker (at least for me).
I see what you mean. I've fixed the problem for myself, like you say, it isn't hard.
I am concerned that users could be put off by this. How about an informative error for them, like this?
class CodeGenerator(object):
def __init__(self):
try:
self._file = tempfile.NamedTemporaryFile(prefix='compiledtrees_', suffix='.cpp', delete=True)
except OSError as e:
if e.errno == 24:
print("Too many open files. Increase limit to 2 * n_trees + 2" \
+ "(unix / mac: ulimit -n [limit], windows: http://bit.ly/2fAKnz0)", file=sys.stderr)
raise e
self._indent = 0
edit: added if
That might be good solution if e.errno == 24
across platforms. As I remember correctly, on Windows I've got some kind of "Permission Denied" errors, which were terrible to debug...
Although I fear we will catch some false positives.
Also an unittest for that would be usefull (see hints on changing limits on all platforms)