loguru
loguru copied to clipboard
Incompatibility with joblib
joblib with its loky backend is widely used for distributed computing. Without going into the details the loky backend has a lot of advantages over the multiprocessing backend.
Loguru does not seem to work with jobli/loky because of a pickle issue. I tried to apply the hints from the doc without success.
I was wondering whether it's possible to make loguru compatible with joblib:
import sys
from joblib import Parallel, delayed
from loguru import logger
def func_async():
logger.info("Hello")
# logger.remove()
# logger.add(sys.stderr, enqueue=True)
args = [delayed(func_async)() for _ in range(100)]
p = Parallel(n_jobs=16, backend="loky")
results = p(args)
The error:
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/queues.py", line 153, in _feed
obj_ = dumps(obj, reducers=reducers)
File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 271, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/loky/backend/reduction.py", line 264, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/home/hadim/local/conda/envs/circus/lib/python3.8/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 563, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object
"""
The above exception was the direct cause of the following exception:
PicklingError Traceback (most recent call last)
/tmp/ipykernel_2094805/2576140850.py in <module>
13
14 p = Parallel(n_jobs=16, backend="loky")
---> 15 results = p(args)
~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time
~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())
~/local/conda/envs/circus/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
~/local/conda/envs/circus/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
442 raise CancelledError()
443 elif self._state == FINISHED:
--> 444 return self.__get_result()
445 else:
446 raise TimeoutError()
~/local/conda/envs/circus/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
387 if self._exception:
388 try:
--> 389 raise self._exception
390 finally:
391 # Break a reference cycle with the exception in self._exception
PicklingError: Could not pickle the task to send it to the workers.
Hi @hadim.
Which version of Loguru are you using?
I tested in on my computer and it works fine (Linux, Python 3.9.2, joblib 1.0.1, loguru 0.5.3).
My setup is the same as yours except I have Python 3.8.10.
I think I found the issue. I was testing within a Jupyter notebook while executing the snippet from a script works as excepted.
That being said I am a bit surprised because I still see the error in my production scripts that are all executed in a regular terminal (no notebook). Those are too complex to reproduce and that's why I have created the small snippet.
Can you imagine any corner cases why it would fail with joblib/loky?
So, I was able to reproduce the issue using a Jupyter notebook.
I can't tell if this is related to the problem you're facing in production.
This happens due to an internal Lock used by sys.stderr, I guess. Here is a reproducible example without involving Loguru:
import sys
from joblib import Parallel, delayed
output = sys.stderr
def func_async():
output.write("Test")
args = [delayed(func_async)() for _ in range(10)]
p = Parallel(n_jobs=16, backend="loky")
results = p(args)
The thing is that the logger is configured with the sys.stderr handler by default which can't be pickled.
I see three possible workarounds:
- Call a setup function once the worker has been started, so that you could add the
sys.stderrbefore running the job and thus avoiding the need to pickle it. In this case, each worker will have its own handler but shared resources won't be protected, which is not ideal. - Replace the default handler with a pickable one. Wrapping the output with
logger.add(lambda m: sys.stderr.write(m))after callinglogger.remove()seems to do the trick. Again, at worker initialization, handler will be deep-copied during pickling, sosys.stderrwon't be protected from parallel access. - Find a way to pass the
loggerand its handlers by making the workers inherit from it instead of pickling it (just like it's done withargsofProcess).
I don't know Joblib API, maybe you can find a clean way to pass the handler by inheritance without pickle?
Thanks for looking into it. This is very insightful. I don't see any clean ways to fix it for now. Passing it as an args will probably call a pickle function too.
I'll post here if I find a workaround.
Though, it's very curious that this only happens with Jupyter. Maybe it's worth asking one of the developers.
Though, it's very curious that this only happens with Jupyter. Maybe it's worth asking one of the developers.
Yup, I feel like my prod scripts have the same issue but so far I haven't been able to reproduce it.
@hadim have you been able to discern what in jupyter makes this fail? (or to reproduce in your scripts?)
Not yet but if you find something please let me know.
I'm having the same issue in scripts (not Jupyter).