qiskit-ibm-runtime Give the option to auto-retry failed jobs

What is the expected feature or enhancement?

Jobs sometimes fail because of temporary server / networking error, and a retry would work. Unfortunately for many iterative algorithms, this means the user would need to restart the entire workload from the beginning. For this reason QuantumInstance has built-in retry. Since we are encouraging people to use Qiskit Runtime / primitives and not QuantumInstance, it'd be nice to have the same capability.

Acceptance criteria

Have the options for users to specify max retry (0 being no retry) when running a QRT job.

Jan 18 '23 01:01 jyu00

I am curious as to why this is a runtime issue, and not an application issue. Namely, a well-written iterative algorithm would save state in between calls so that if one job failed, the routine can be restarted in-place.

Jan 18 '23 15:01 nonhermitian

Two reasons - one being it makes it easier for algorithm developers so they don't need to each implement their own thing (unless they choose to), and we want algorithm dev to use our service. The other being many of these failures are caused by the current instability of the runtime service. Auto-retry is just an attempt to mitigate that a bit.

Jan 18 '23 23:01 jyu00

@nonhermitian, I agree with @jyu00. We already implement several flavors of "retry-primitives" to run our algorithms, you can check out this issue for reference (@luciacuervovalor keeps it updated with the latest retry features). It would be totally fine to rely on these kind of enhancements if the runtime failures were an exceptional thing, but our current situation is that we need them in almost every experiment we run. At least until the service is stabilized, I think this would be a great feature for the users and algorithm developers.

Jan 23 '23 09:01 ElePT

I'm interested in this as well. I have had some issues with VQE calculations returning "Internal server errors" and it would be good to have the job retry a few times to check if the error was a just a fluke before the Session deactivates.

Mar 04 '23 15:03 MarcoBarroca

I've been messing around trying to see if I can solve this.

I tried creating a custom Estimator that is a child class of the runtime Estimator and inserts retries in the _run() method. This does't seem to work, as soon as a job fails all other tries immediately fail.

Seems like it would make more sense to add a max_job_retries option to Session(). I tried doing it by using the same method as above and creating a child CustomSession() class.

Unfortunately I get the following error

Backend <__main__.CustomSession object at 0x29adc3c70> cannot be found in any hub/group/project for this account. <__main__.CustomSession object at 0x29adc3c70> is of type <class '__main__.CustomSession'> but should instead be initialized through the <QiskitRuntimeService>."

Are we not allowed to build custom Sessions? If yes then I guess we need to wait for this to be implemented.

Mar 07 '23 00:03 MarcoBarroca

Hey @MarcoBarroca! Please look at @ElePT's message above, there's a link to the code we are using for this and hopefully it fixes your problem

Mar 07 '23 08:03 luciacuervovalor

Didn't see you had an implementation already! Will try it out!

Mar 07 '23 13:03 MarcoBarroca

Can we increase the priority of this. I continue to see people posting problems where a job has failed and has stopped things. It affects many algorithms, and users in general if they use the primitives themselves where the task can have several/many steps they are doing. In general the primitive has the best knowledge of whether an error is recoverable - and while it might be possible to signal that in the failure, this would require every primitive user to deal with this aspect. Having recovery in one place will make these more usable/consumable, and if logic needs updating due to new sources of failure possibilities it just has to be done in one place. Its why QuantumInstance had the equivalent function so it was one place, and all algorithms/users did not need to do this.

Mar 24 '23 13:03 woodsp-ibm

qiskit-ibm-runtime qiskit-ibm-runtime copied to clipboard

Give the option to auto-retry failed jobs

qiskit-ibm-runtime
qiskit-ibm-runtime copied to clipboard