aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`BaseRestartWorkChain`: Set different `max_iterations` for each `processhandler`

Open mbercx opened this issue 4 years ago • 5 comments

Is your feature request related to a problem? Please describe

Currently, the BaseRestartWorkChain allows you to set the maximum iterations - i.e. the maximum number of times the work chain will run the underlying calculation job - via the input max_iterations. However, for some errors, it would be nice to be able to increase the number of times they are corrected (see for example this issue), and for some, you might just want to try to correct once.

Describe the solution you'd like

It would be useful if you could set the max_iterations separately for each process_handler. The defaults could be set by providing a value to the process_handler decorator, e.g.:

    @process_handler(priority=580, max_iterations=10, exit_codes=[
        PwCalculation.exit_codes.ERROR_OUT_OF_WALLTIME,
    ])
    def handle_out_of_walltime(self, calculation):
        """Handle `ERROR_OUT_OF_WALLTIME` exit code: calculation shut down neatly and we can simply restart."""

The max_iterations input can be either an Int (same use case as now, overriding all defaults) or a Dict input that maps the number of max iterations you want for each of the process handlers/exit codes. I suppose the keys here can be either the process handler method or the exit code that is being fixed.

Describe alternatives you've considered

I've also considered an idea inspired by the way custodian does things: make the process handlers classes that can be passed to the BaseRestartWorkChain as a List input. This way the error handling can be even more customised by the user, i.e. he or she can just write a ProcessHandler class which can be instantiated, with max_iterations and potentially other settings such as thresholds as the arguments in the constructor.

mbercx avatar Feb 19 '21 15:02 mbercx

Pinging @sphuber and @giovannipizzi for comments. 🙃

mbercx avatar Feb 24 '21 19:02 mbercx

I like the idea of customising the number of max iterations; actually this might even solve on current issue (I think) where there is just a total # iterations, indepependent of which handler has been triggered. However, how will this work? I.e. if I trigger handler 1 twice, then handler 2, then handler 1 again: is the counter back to 0 (because a different handler was called), or already at 2? Also, would it make sense to have the counter be 'resettable' in the code (so it cannot stay in the decorator if one wants to allow to set its value, or at least it should be possible to reset it)? E.g. you might decide at runtime only if you want to reset the counter to zero or not, depending on what other handlers have been called.

giovannipizzi avatar Feb 24 '21 21:02 giovannipizzi

I have the same reservations as @giovannipizzi voices. I don't think it makes sense to have a single max_iterations if it can be different depending on the case.

Also, would it make sense to have the counter be 'resettable' in the code (so it cannot stay in the decorator if one wants to allow to set its value, or at least it should be possible to reset it)? E.g. you might decide at runtime only if you want to reset the counter to zero or not, depending on what other handlers have been called.

This is already the case. The current iteration is simply stored in the context under self.ctx.iteration and so it can be manipulated or reset in a process handler since that has access to the context. I would rather "recommend" users doing this than implementing an official support for multiple max iterations since I simply don't see how that would be defined to operate.

I've also considered an idea inspired by the way custodian does things: make the process handlers classes that can be passed to the BaseRestartWorkChain as a List input. This way the error handling can be even more customised by the user, i.e. he or she can just write a ProcessHandler class which can be instantiated, with max_iterations and potentially other settings such as thresholds as the arguments in the constructor.

Just for some historic background information: when working on moving the BaseRestartWorkChain from aiida-quantumespresso to aiida-core (which was a few months before you joined the group) we discussed extensively the possibility to "dynamically" attach handlers through an input, which we ultimately, after lots of discussion, rejected. The reason being that the handlers form an important part of the logic of a workflow and therefore the provenance. This is currently captured in a certain way because each WorkflowNode registers the version number of the plugin of the WorkChain class and with it, the full definition can in principle be retrieved. If we start to allow to inject arbitrary code by making process handlers dynamically injectable we destroy this provenance. This is why we have made it compulsory to define process handlers as WorkChain methods such that they are guaranteed to be part of the class. As a compromise, we added the possibility to disable handlers by default, and to toggle their activity on the fly as an input, but this guarantees that "full provenance" is maintained.

sphuber avatar Feb 24 '21 22:02 sphuber

This issue was also raised in a discussion with @superstar54 today, to be revisited!

mbercx avatar Apr 23 '25 12:04 mbercx

I also like this idea. Indeed, max_iterations can vary depending on the nature of the failure.

If we start to allow arbitrary code injection by making process handlers dynamically injectable, we undermine provenance. That’s why we’ve made it mandatory for handlers to be defined as WorkChain methods—to ensure they are part of the class definition.

Regarding provenance, I don’t think this poses a problem. If handlers are passed as inputs, we can still store the necessary metadata—such as the module path and version—to ensure reproducibility, just as we do with the WorkChain class itself.

However, how will this work? For instance, if handler 1 is triggered twice, then handler 2, then handler 1 again—does the counter for handler 1 reset (because a different handler was called), or is it already at 2?

A simple solution would be to introduce explicit dependencies between handlers. For example:

{
  "name": "handler 2",
  "max_iterations": 10,
  "dependencies": ["handler 1"]
}

In this setup, if handler 1 is triggered, the counter for handler 2 would be reset. This allows us to encode logic for resetting iteration counts based on handler relationships.

superstar54 avatar Apr 23 '25 15:04 superstar54