loguru icon indicating copy to clipboard operation
loguru copied to clipboard

Incompatibility with joblib

Open hadim opened this issue 4 years ago • 11 comments

joblib with its loky backend is widely used for distributed computing. Without going into the details the loky backend has a lot of advantages over the multiprocessing backend.

Loguru does not seem to work with jobli/loky because of a pickle issue. I tried to apply the hints from the doc without success.

I was wondering whether it's possible to make loguru compatible with joblib:

import sys
from joblib import Parallel, delayed
from loguru import logger


def func_async():
    logger.info("Hello")

# logger.remove()
# logger.add(sys.stderr, enqueue=True)
    
args = [delayed(func_async)() for _ in range(100)]

p = Parallel(n_jobs=16, backend="loky")
results = p(args)

The error:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object
"""

The above exception was the direct cause of the following exception:

PicklingError                             Traceback (most recent call last)
/tmp/ipykernel_2094805/2576140850.py in <module>
     13 
     14 p = Parallel(n_jobs=16, backend="loky")
---> 15 results = p(args)

~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/local/conda/envs/circus/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    442                     raise CancelledError()
    443                 elif self._state == FINISHED:
--> 444                     return self.__get_result()
    445                 else:
    446                     raise TimeoutError()

~/local/conda/envs/circus/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    387         if self._exception:
    388             try:
--> 389                 raise self._exception
    390             finally:
    391                 # Break a reference cycle with the exception in self._exception

PicklingError: Could not pickle the task to send it to the workers.

hadim avatar Sep 03 '21 03:09 hadim

Hi @hadim.

Which version of Loguru are you using?

I tested in on my computer and it works fine (Linux, Python 3.9.2, joblib 1.0.1, loguru 0.5.3).

Delgan avatar Sep 03 '21 10:09 Delgan

My setup is the same as yours except I have Python 3.8.10.

I think I found the issue. I was testing within a Jupyter notebook while executing the snippet from a script works as excepted.

That being said I am a bit surprised because I still see the error in my production scripts that are all executed in a regular terminal (no notebook). Those are too complex to reproduce and that's why I have created the small snippet.

Can you imagine any corner cases why it would fail with joblib/loky?

hadim avatar Sep 03 '21 12:09 hadim

So, I was able to reproduce the issue using a Jupyter notebook.

I can't tell if this is related to the problem you're facing in production.

This happens due to an internal Lock used by sys.stderr, I guess. Here is a reproducible example without involving Loguru:

import sys
from joblib import Parallel, delayed

output = sys.stderr

def func_async():
    output.write("Test")
    
args = [delayed(func_async)() for _ in range(10)]

p = Parallel(n_jobs=16, backend="loky")
results = p(args)

The thing is that the logger is configured with the sys.stderr handler by default which can't be pickled.

I see three possible workarounds:

  • Call a setup function once the worker has been started, so that you could add the sys.stderr before running the job and thus avoiding the need to pickle it. In this case, each worker will have its own handler but shared resources won't be protected, which is not ideal.
  • Replace the default handler with a pickable one. Wrapping the output with logger.add(lambda m: sys.stderr.write(m)) after calling logger.remove() seems to do the trick. Again, at worker initialization, handler will be deep-copied during pickling, so sys.stderr won't be protected from parallel access.
  • Find a way to pass the logger and its handlers by making the workers inherit from it instead of pickling it (just like it's done with args of Process).

I don't know Joblib API, maybe you can find a clean way to pass the handler by inheritance without pickle?

Delgan avatar Sep 05 '21 16:09 Delgan

Thanks for looking into it. This is very insightful. I don't see any clean ways to fix it for now. Passing it as an args will probably call a pickle function too.

I'll post here if I find a workaround.

hadim avatar Sep 05 '21 17:09 hadim

Though, it's very curious that this only happens with Jupyter. Maybe it's worth asking one of the developers.

Delgan avatar Sep 05 '21 17:09 Delgan

Though, it's very curious that this only happens with Jupyter. Maybe it's worth asking one of the developers.

Yup, I feel like my prod scripts have the same issue but so far I haven't been able to reproduce it.

hadim avatar Sep 05 '21 17:09 hadim

@hadim have you been able to discern what in jupyter makes this fail? (or to reproduce in your scripts?)

ludaavics avatar Mar 03 '22 14:03 ludaavics

Not yet but if you find something please let me know.

hadim avatar Mar 03 '22 14:03 hadim

I'm having the same issue in scripts (not Jupyter).

jmrichardson avatar Apr 08 '22 16:04 jmrichardson