qutip
qutip copied to clipboard
parallel map raise subprocess error.
Description
parallel_map
did not propagate error of sub-procceses:
def f(i):
raise Exception
return i
out = qutip.solver.parallel.parallel_map(f, range(5))
would print:
ERROR:concurrent.futures:exception calling callback for <Future at 0x7fcdfb9b8e50 state=finished raised Exception>
concurrent.futures.process._RemoteTraceback:
...
raise Exception
but it would return normally with out = [None, None, None, None, None]
.
This mean we can't use a try
block to catch error coming from sub-process and only when the output is used, an error is risen.
This change parallel_map
so it raise the first error it encounter.
Coverage decreased (-0.02%) to 71.564% when pulling c472ccfc34eeeb07628b2ec7e7e849220fe7ac51 on Ericgig:parallel.raise into 50be470d9d23981c7cbfed0b04a702a23d440eb8 on qutip:dev.major.
I made a custom exception class that depend on all the raised exceptions classes so we can catch it.
It feel somewhat hacky, ExceptionGroup
would work a lot better, but we won't be able to use it for a few years.
I was imagining something much simpler for the exception class. There is a backport of the exceptiongroup stuff to older Pythons that we could use if we wanted to: https://pypi.org/project/exceptiongroup/.
It's another dependency, but it will mean that on Python 3.11 people could start using ExceptionGroup and except*
immediately if they wanted to. Users on Python 3.10 and below would have to catch ExceptionGroup or use the backport's ugly with catch(...)
mechanism.
Then, many years from now when Python 3.10 is obsolete, we could drop the backport entirely.
Personally I'm in favour of the simpler ParallelMapException
that doesn't inherit from all the other exceptions. Then when 3.11 is out we can make ParallelMapException
inherit from ExceptionGroup
and 3.11 users will be able to use except*
. People supporting 3.10 and below will have to use except ParallelMapException
, but that will work just as well on 3.11.
I don't want the way to catch error depend on the map, serial_map
and loky_map
both return the error of the first failing iteration and stop there. Having parallel_map
failing differently will mean that every time we want to catch an error in mcsolve
, etc. we will have to catch both the expected error and the custom ParallelMapException
. It also hide the kind of error and prevent us to act on it.
I still think that stopping the map and returning the first error is the best options. It will ensure that all map functions fail the same way. In our case, the raised exceptions should be consistent so we should not need run all iteration to see every ways it can fail. And we don't waste computing time to compute result that will not be returned.
Another options would to not raise any error and return the finished iteration. For mcsolve, we could have a result returning 99% of the asked trajectories with the note that 1% failed because of overwork from the solver. Trowing away all the work done for one error feels bad. However I can see this not being safe, so it would be a solver options for the user to set.
Hmm. It would be nice to get the results that succeeded. What if we made a single exception class that was used by all the functions and which contained all the errors and all the results? Then everyone can catch the same exception class and decide what to do with all the results and errors?
I will try something in that direction, but add it as an user option, I can see case were failing trajectories are those getting a specific collapse or other ways that skews the results, so failing fast would be better. Maybe having the progress-bar show failed jobs would be good if we add this, (not in this PR).
Sounds good. Perhaps we can just have fail_fast=True
default option on all the functions.