qutip icon indicating copy to clipboard operation
qutip copied to clipboard

parallel map raise subprocess error.

Open Ericgig opened this issue 2 years ago • 7 comments

Description parallel_map did not propagate error of sub-procceses:

def f(i):
    raise Exception
    return i

out = qutip.solver.parallel.parallel_map(f, range(5))

would print:

ERROR:concurrent.futures:exception calling callback for <Future at 0x7fcdfb9b8e50 state=finished raised Exception>
concurrent.futures.process._RemoteTraceback: 
...
raise Exception

but it would return normally with out = [None, None, None, None, None].

This mean we can't use a try block to catch error coming from sub-process and only when the output is used, an error is risen.

This change parallel_map so it raise the first error it encounter.

Ericgig avatar Aug 05 '22 18:08 Ericgig

Coverage Status

Coverage decreased (-0.02%) to 71.564% when pulling c472ccfc34eeeb07628b2ec7e7e849220fe7ac51 on Ericgig:parallel.raise into 50be470d9d23981c7cbfed0b04a702a23d440eb8 on qutip:dev.major.

coveralls avatar Aug 05 '22 19:08 coveralls

I made a custom exception class that depend on all the raised exceptions classes so we can catch it. It feel somewhat hacky, ExceptionGroup would work a lot better, but we won't be able to use it for a few years.

Ericgig avatar Aug 08 '22 20:08 Ericgig

I was imagining something much simpler for the exception class. There is a backport of the exceptiongroup stuff to older Pythons that we could use if we wanted to: https://pypi.org/project/exceptiongroup/.

It's another dependency, but it will mean that on Python 3.11 people could start using ExceptionGroup and except* immediately if they wanted to. Users on Python 3.10 and below would have to catch ExceptionGroup or use the backport's ugly with catch(...) mechanism.

Then, many years from now when Python 3.10 is obsolete, we could drop the backport entirely.

Personally I'm in favour of the simpler ParallelMapException that doesn't inherit from all the other exceptions. Then when 3.11 is out we can make ParallelMapException inherit from ExceptionGroup and 3.11 users will be able to use except*. People supporting 3.10 and below will have to use except ParallelMapException, but that will work just as well on 3.11.

hodgestar avatar Aug 09 '22 13:08 hodgestar

I don't want the way to catch error depend on the map, serial_map and loky_map both return the error of the first failing iteration and stop there. Having parallel_map failing differently will mean that every time we want to catch an error in mcsolve, etc. we will have to catch both the expected error and the custom ParallelMapException. It also hide the kind of error and prevent us to act on it.

I still think that stopping the map and returning the first error is the best options. It will ensure that all map functions fail the same way. In our case, the raised exceptions should be consistent so we should not need run all iteration to see every ways it can fail. And we don't waste computing time to compute result that will not be returned.

Another options would to not raise any error and return the finished iteration. For mcsolve, we could have a result returning 99% of the asked trajectories with the note that 1% failed because of overwork from the solver. Trowing away all the work done for one error feels bad. However I can see this not being safe, so it would be a solver options for the user to set.

Ericgig avatar Aug 09 '22 13:08 Ericgig

Hmm. It would be nice to get the results that succeeded. What if we made a single exception class that was used by all the functions and which contained all the errors and all the results? Then everyone can catch the same exception class and decide what to do with all the results and errors?

hodgestar avatar Aug 09 '22 13:08 hodgestar

I will try something in that direction, but add it as an user option, I can see case were failing trajectories are those getting a specific collapse or other ways that skews the results, so failing fast would be better. Maybe having the progress-bar show failed jobs would be good if we add this, (not in this PR).

Ericgig avatar Aug 09 '22 14:08 Ericgig

Sounds good. Perhaps we can just have fail_fast=True default option on all the functions.

hodgestar avatar Aug 09 '22 14:08 hodgestar