loky icon indicating copy to clipboard operation
loky copied to clipboard

AttributeError: 'NoneType' object has no attribute 'fileno'

Open pseudotensor opened this issue 7 years ago • 20 comments

Tried switching our code from concurrent.futures to loky and can't get passed the below error.

Even if I do what this says:

https://stackoverflow.com/questions/45126368/nonetype-object-has-no-attribute-fileno

it doesn't help.

2018-11-01 13:56:28,766 A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
2018-11-01 13:56:28,769 loky.process_executor._RemoteTraceback: 
'''
Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 383, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 41, in <module>
    import datatable as dt
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/__init__.py", line 8, in <module>
    from .dt_append import rbind, cbind
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/dt_append.py", line 8, in <module>
    from datatable.utils.misc import plural_form as plural
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/misc.py", line 8, in <module>
    from .typechecks import TImportError
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/typechecks.py", line 14, in <module>
    from datatable.utils.terminal import term
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/terminal.py", line 62, in <module>
    term = MyTerminal()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/terminal.py", line 20, in __init__
    super().__init__()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/blessed/terminal.py", line 171, in __init__
    self._keyboard_fd = sys.__stdin__.fileno()
AttributeError: 'NoneType' object has no attribute 'fileno'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 2001, in call_subprocess_onetask
    kwargs=kwargs, out=ret_list, justcount=False, proctitle=proctitle)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1513, in submit_tryget
    self.initpool(proctitle=proctitle, proctitle_sub=proctitle_sub)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1147, in initpool
    self.submit_dummy(None, dummy_function, (), {}, proctitle=proctitle, proctitle_sub=proctitle_sub)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1408, in submit_dummy
    result.result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

2018-11-01 13:56:28,769 Failed call_subprocesss_onetask again for func=<function get_have_lightgbm_subprocess at 0x7fc495a98840> after exception=A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
_________________________________________________________________________________________________________ ERROR collecting tests/test_system_alt/test_gpu_lock_check.py __________________________________________________________________________________________________________
loky.process_executor._RemoteTraceback: 
'''
Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 383, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 41, in <module>
    import datatable as dt
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/__init__.py", line 8, in <module>
    from .dt_append import rbind, cbind
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/dt_append.py", line 8, in <module>
    from datatable.utils.misc import plural_form as plural
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/misc.py", line 8, in <module>
    from .typechecks import TImportError
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/typechecks.py", line 14, in <module>
    from datatable.utils.terminal import term
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/terminal.py", line 62, in <module>
    term = MyTerminal()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/datatable/utils/terminal.py", line 20, in __init__
    super().__init__()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/blessed/terminal.py", line 171, in __init__
    self._keyboard_fd = sys.__stdin__.fileno()
AttributeError: 'NoneType' object has no attribute 'fileno'

A simple attempt to repro:

import datatable as dt

import os
from time import sleep
from loky import get_reusable_executor


def say_hello(k):
    pid = os.getpid()
    print("Hello from {} with arg {}".format(pid, k))
    sleep(.01)
    return pid


# Create an executor with 4 worker processes, that will
# automatically shutdown after idling for 2s
executor = get_reusable_executor(max_workers=4, timeout=2)

res = executor.submit(say_hello, 1)
print("Got results:", res.result())

results = executor.map(say_hello, range(50))
n_workers = len(set(results))
print("Number of used processes:", n_workers)
assert n_workers == 4

Doesn't fail. So it's unclear what is going on.

pseudotensor avatar Nov 01 '18 21:11 pseudotensor

This is the only change I made to my code:


use_loky = True

if use_loky:
    from concurrent.futures import ThreadPoolExecutor as pool_thread
    #from loky import get_reusable_executor as pool_fork
    from loky import ProcessPoolExecutor as pool_fork
    from loky import TimeoutError as pool_timeout
    from loky import BrokenProcessPool as pool_broken
    from concurrent.futures import as_completed as pool_as_completed
    sys.__stdin__ = sys.stdin
    sys.__stdout__ = sys.stdout
else:
    from concurrent.futures import ThreadPoolExecutor as pool_thread
    from concurrent.futures import ProcessPoolExecutor as pool_fork
    from concurrent.futures import TimeoutError as pool_timeout
    from concurrent.futures.process import BrokenProcessPool as pool_broken
    from concurrent.futures import as_completed as pool_as_completed

pseudotensor avatar Nov 01 '18 21:11 pseudotensor

I can reproduce this with loky and the following script:

from loky import get_reusable_executor


def import_dt(k):
     import datatable as dt


# Create an executor with 2 worker processes
executor = get_reusable_executor(max_workers=2)
executor.submit(import_dt, 1).result())

It was not failing in your previous example because the function say_hello did not use the module datatable, so it was not imported in the worker.

The issue here seems to be linked to MyTerminal object, instanciated when importing datatable, which expect stdin to implement a method fileno. However, when the new process is spawned, we set stdin to None. It should be fixed on datatable master (see here).

tomMoral avatar Nov 02 '18 11:11 tomMoral

Can I ask why you do something different than concurrent.futures in this regard? It seems to break compatibility. Thanks!

pseudotensor avatar Nov 02 '18 20:11 pseudotensor

Now that datatable fixed that blessed use of stdin/stdout, I get still:

Exception in thread QueueManagerThread:
Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 635, in _queue_management_worker
    thread_wakeup.clear()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 157, in clear
    while self._reader.poll():
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/connection.py", line 255, in poll
    self._check_closed()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/connection.py", line 136, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed


pseudotensor avatar Nov 02 '18 21:11 pseudotensor

Also get:

Exception in thread QueueManagerThread:
Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 635, in _queue_management_worker
    thread_wakeup.clear()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 158, in clear
    self._reader.recv_bytes()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
OSError: [Errno 9] Bad file descriptor


pseudotensor avatar Nov 02 '18 21:11 pseudotensor

The discrepancy with concurrent.futures is because we do not start the processes with fork by default. Starting the processes with fork breaks the POSIX standard and can lead to unexpected behavior, in particular with third-party libraries such as openmp.

For your OSError, this is really weird. This is standard multiprocessing objects so they should be working. Could you share the code which lead to such exceptions? Also, do you know when in your code this error happens? Is it during the initialization of the Executor or the shutdown?

tomMoral avatar Nov 03 '18 15:11 tomMoral

After a while, I was able to narrow it down to this:

self.p.shutdown(wait=self.shutdown_wait)

I was calling this for self.p assigned to your pool, just like I would concurrent.futures. However, for some reason this fails in the way above for your pool.

pseudotensor avatar Nov 04 '18 06:11 pseudotensor

The other thing I notice is using your pool leads to very poor performance if I use:

from loky import ProcessPoolExecutor as pool_fork

If I just make that change above, and run a test that does the fork etc. over and over 100 times, your pool takes about 10X longer per fork. Is this expected?

But if I use your reusable version it seems ok:

from loky import get_reusable_executor as pool_fork

But this scares me. Why is the loky version of ProcessPoolExecutor so extremely slow?

pseudotensor avatar Nov 04 '18 06:11 pseudotensor

Also, if I try to use your reusable pool class for my normal full code, then I get yet another problem:

  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1410, in submit_dummy
    result.result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGTERM(-15)}

This is just for submitting a dummy do-nothing function so the pool workers get initialized and I know their pids. concurrent.futures has no issues.

Your ProcessPoolExecutor doesn't have this failure, but as I said that one is way too slow.

pseudotensor avatar Nov 04 '18 07:11 pseudotensor

Ok, I narrowed that down to me sending a SIGTERM and then SIGKILL once the task is done after the shutdown in order to ensure the workers really terminate.

So this all seems related to the reusable pool not being able to be fully shutdown. How do I do that?

pseudotensor avatar Nov 04 '18 07:11 pseudotensor

10358 jon       20   0   96656  22308   7748 R  14.9  0.1   0:00.45 /home/jon/.pyenv/versions/3.6.4/bin/python -m loky.backend.popen_loky_posix --process-name LokyProcess-3 --pipe 17 --semaphore 9                                               

The other thing you seem to do is mess with proctitles and command lines. This makes loky break compatibility with anything that relies upon proctitles or command lines. I use setproctitle package to change the process title/command line and I rely upon that heavily.

pseudotensor avatar Nov 04 '18 07:11 pseudotensor

I get yet another problem when trying to again use the full code I have. I get the below for numerous tests.

Yet some other problem that never occurs with concurrent.futures.

[gw13] FAILED tests/test_models/test_stacking.py::test_rulefit_regression 
___________________________________________________________________________________________________________________________________________________________________ test_rulefit_regression ____________________________________________________________________________________________________________________________________________________________________
[gw13] linux -- Python 3.6.4 /home/jon/.pyenv/versions/3.6.4/bin/python
loky.process_executor._RemoteTraceback: 
"""
loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/backend/queues.py", line 150, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/backend/reduction.py", line 230, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/backend/reduction.py", line 223, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 284, in dump
    return Pickler.dump(self, obj)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 703, in save_instancemethod
    self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 610, in save_reduce
    save(args)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 419, in save_function
    self.save_function_tuple(obj)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 579, in save_function_tuple
    save(state)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/pickle.py", line 496, in save
    rv = reduce(self.proto)
TypeError: 'NoneType' object is not callable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py", line 410, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/jon/h2oai/tests/test_models/test_stacking.py", line 915, in run_subprocess
    model.fit(X=train_X.copy(), y=train_y)
  File "/home/jon/h2oai/h2oaicore/models.py", line 6549, in fit
    early_stopping_limit=early_stopping_limit, verbose=verbose, **kwargs)
  File "/home/jon/h2oai/h2oaicore/models.py", line 2556, in fit
    kwargs=mykwargs, out=res)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1528, in submit_tryget
    overloadcore_factor=self.overloadcore_factor)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1217, in try_get_internal
    res = future[wfut].result(timeout=sleeptouse)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jon/h2oai/tests/test_models/test_stacking.py", line 325, in test_rulefit_regression
    num_classes=1, random_state=1234), "Pressure9am", 1, test_name=kwargs.pop('test_name', None))
  File "/home/jon/h2oai/tests/test_models/test_stacking.py", line 884, in run
    call_subprocess_onetask(run_subprocess, (model, target, num_classes, test_name), {})
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 2007, in call_subprocess_onetask
    p.finish()
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1585, in finish
    timeout=timeouttouse)
  File "/home/jon/h2oai/h2oaicore/systemutils.py", line 1217, in try_get_internal
    res = future[wfut].result(timeout=sleeptouse)
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/home/jon/.pyenv/versions/3.6.4/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

pseudotensor avatar Nov 04 '18 09:11 pseudotensor

Even the reuseable pool is very slow compared to concurrent.futures for many tests that work.

pseudotensor avatar Nov 04 '18 09:11 pseudotensor

I also get these kind of errors:


platform linux -- Python 3.6.4
pytest==3.5.1
py==1.6.0
pluggy==0.6.0
rootdir: /home/jon/h2oai
inifile: pytest.ini
plugins: xdist-1.22.2, tldr-0.1.5, timeout-1.2.1, repeat-0.7.0, instafail-0.4.0, forked-0.2, cov-2.5.1
cachedir: .pytest_cache
pydev debugger: process 31859 is connecting

/home/jon/.pyenv/versions/3.6.4/lib/python3.6/site-packages/loky/process_executor.py:698: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1307, in stoptrace
    get_frame(), also_add_to_passed_frame=True, overwrite_prev_trace=True, dispatch_func=lambda *args:None)
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1038, in exiting
    sys.stdout.flush()
ValueError: I/O operation on closed file.
pydev debugger: process 32119 is connecting

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1307, in stoptrace
    get_frame(), also_add_to_passed_frame=True, overwrite_prev_trace=True, dispatch_func=lambda *args:None)
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py", line 1038, in exiting
    sys.stdout.flush()
ValueError: I/O operation on closed file.

pseudotensor avatar Nov 04 '18 09:11 pseudotensor

With the pickling problem, it seems cloudpickle is unable to pickle methods while pickle can. This is despite cloudpickle claiming improvements to pickle.

Pausing in my code before the failure, I see:

image

So maybe your choice to use cloudpickle was not a good one?

pseudotensor avatar Nov 04 '18 09:11 pseudotensor

Overall, so far, still these issues:

  1. normal shutdown call causes loky to fail
  2. loky is much slower, especially its version of ProcessPoolExecutor
  3. "ValueError: I/O operation on closed file" errors
  4. pickling failures by cloudpickle.
  5. Unexpected Proctitle/command line modifications

#1 may not be a deal breaker, but #2-#5 are definitely.

pseudotensor avatar Nov 04 '18 09:11 pseudotensor

First of all, what is your goal in using loky? It is unclear and you seem to be confused on what is the purpose of loky.

The goal of loky is to have a cross-platform, cross-version implementation of the ProcessPoolExecutor, which work in most cases without needing to tweak anything. We made some design choices which can lead to decreased performance in some cases but it depends on your needs.

Normal shutdown call causes loky to fail

In all our tests, normal shutdown works properly. So it is probably not a normal shutdown... Could you provide a minimal reproducing example (MVCE) so we can investigate this failure?

loky is much slower, especially its version of ProcessPoolExecutor

Yes indeed. The implementation of ProcessPoolExecutor in concurrent.futures start its processes with fork, which can break the POSIX convention and lead to bad interaction with third party library (such as openmp). The default in loky is to start processes with the loky context which is typically slower. To mitigate this issue, we provide get_reusable_executor which avoid restarting over and over the processes. So it is suppose to be slower if you use ProcessPoolExecutor as you are re-starting processes every time and starting processes can be up-to 100x slower with our implementation than with fork. Note that you should get similar performances with ProcessPoolExecutor by setting the context='fork' argument but it is not advised as it can lead to freezed processes with third party library.

Also, the reusable executor might be slower because of the serialization. As we rely on cloudpickle, it could be slower than pickle. It depends on your use case and I cannot be more specific without a proper MCVE.

"ValueError: I/O operation on closed file" errors

This is caused by bad interaction with pycharm that seems to try to get some traces from the workers. It seems to work with concurrent.futures but only with fork. Not sure we can do anything about it, as it seems to be because of uncatched error in pycharm, at exit of the workers.

pickling failures by cloudpickle.

You can use set_loky_pickler('pickle') to use pickle in loky if you don't need functionality of cloudpickle such as pickling interactively defined functions and you find pickle can serialize more object. Also, you could open an issue on the cloudpickle repo for this use case. But I suspect that once again, you don't see this error with concurrent.futures because it relies on fork and using pickle will also break in this case. This is probably due to the fact that you use xgboost which has some internal C objects that are not serializable.

Unexpected Proctitle/command line modifications

The new processes are started with different command line so we cannot change it without messing with OS-specific implementation details. The same proctitle "feature" is only a side-effect from the fork implementation and is not guaranteed, for instance with windows. As our goal is to be cross-platform, we do not change the way the processes are named.


Overall, you seem to rely heavily on the implementation details of fork which are not guaranteed in loky. The goal of loky is to move away from fork since it causes several critical error with scientific computation libraries and it is not cross-platform.

Also, if you want more specific answer, please provide MVCE. The errors you report just include the traceback and are often linked to some dependencies, so it is hard to parse without having an actual reproducing script.

tomMoral avatar Nov 04 '18 19:11 tomMoral

No, the xgboost stuff is just one example, and I showed how it is fully picklable already by the image above

pseudotensor avatar Nov 04 '18 19:11 pseudotensor

I am not relying upon fork, just the behavior of concurrent.futures. You give the impression that loky only improves upon it, but I see you have broken compatibility for various reasons or have bugs that require more testing. I could help you debug but I don’t have so much time as payoff may be low if never realky works properly.

Definitely the goal of a more reliable concurrent.futures with proper handling of threaded apps that rely upon OpenMP is what I want. XGBoost, lightgbm, datatable all use openmp. So no I am not confused

pseudotensor avatar Nov 04 '18 19:11 pseudotensor

So maybe your choice to use cloudpickle was not a good one?

Please report a minimal example on the cloudpickle issue tracker so that we can fix it (if it's picklabe using the pickle implementation from the standard library).

ogrisel avatar Nov 06 '18 08:11 ogrisel