MONAI
MONAI copied to clipboard
Add interrupt/terminate option to Trainers & Evaluators
Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.
Describe the solution you'd like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.
For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL's MonaiAlgo class; private repo - contact me if you need access)
def abort(self):
self.trainer.terminate()
# save current iteration for next round
setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)
if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
# if current iteration is end of 1 epoch, manually trigger epoch completed event
self.trainer._fire_event(Events.EPOCH_COMPLETED)
def finalize(self):
self.trainer.terminate()
Describe alternatives you've considered n/a
Additional context n/a
This functionality should also be supported by evaluation/inference workflows.
Consider calling it interrupt()
and terminate()
to better align with the Ignite naming convention.
FYI, interrupt()
is necessary to support iteration-based aggregation workflows in FL.
Hi @vfdev-5 ,
I think this feature request is something similar to pause / resume
logic in ignite engine.
What do you think?
Thanks.
Thanks @holgerroth and @Nic-Ma for bringing this topic for the discussion.
This is one of the desired features we are waiting for.
The interrupt calls should be supported by Ignite trainer for MONAI apps.
Considering the complete training or evaluation cycles in remote machines in FL environments. Following interrupt calls could be handy:
- trainer_pause --> to pause the current training iteration
- trainer_resume --> to resume from last paused training iteration
- trainer_abort --> to terminate current iteration and discard any update in the model_state_dict
- evaluator_abort --> to terminate current evaluation and discard any computed metric
NVDIA, MONAI, and Ignite developers can suggest better and more compliant naming conventions for these functionalities.
Hi everyone , thanks for pinging.
Yes, interupt
feature may be nice to have in ignite. Today, it is possible to resume training from stopped trainer (under certain assumptions). I'll give an example below.
As for pausing and resuming current training iteration, this can be a bit tricky regarding the input dataloader. Basically, we have to store the iterator and make sure that resuming continues iterating over new batches and not from the begging...
Let me prioritize this feature from ignite side.
Run/Resume Ignite Engine logic:
# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5
# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7
# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch
# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7
# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5
# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9
Hi @vfdev-5 ,
Cool, glad to see your help here.
Currently, we use a workaround to support iteration pause / resume
:
setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)
If you can support from ignite natively, that would be great!
Thanks in advance.
Hi @vfdev-5 ,
Do you have any development plan for this feature now?
Thanks in advance.
Hi @Nic-Ma , this feature has been already in the plan for Engine refactoring. We have to push forward a bit more on that.
Hi @vfdev-5 ,
Cool! We are planning for the next release of MONAI, do you have any more concrete release plan for this feature in ignite?
Thanks.
Unfortunately, Engine refactoring can be complicated to release quickly. It lasts since some time already and I feel that we have to change the strategy for adopting such large things... Maybe we could just add the feature you would like first and do the refactoring a bit later. I'll let you know once we have concrete schedule.
Hi @Nic-Ma , sorry for delay, when you plan to your next release ? Perhaps, we could make the feature and release before ...
Hi @vfdev-5, thanks for the consideration! in mid-September we plan to have a major milestone and ideally we could include this feature from pytorch-ignite.
Hi @vfdev-5 ,
Do you have any update from ignite side?
We are working on the FL module for the next release at early Sep, which depends on the interrupt / terminate
logic.
Thanks in advance.
@Nic-Ma let us try to include this feature to our next upcoming release
Hi @vfdev-5 ,
Cool, may I know your next release time?
Thanks.
Hi @vfdev-5 , thanks for your help here, we plan to code-freeze at early Sep for MONAI 1.0. May I know when is your next release time?
Thanks in advance.
Sorry for delayed reply @Nic-Ma , our plan is to release around the end of august.
Is the interrupt / resume
feature already in the dev branch of ignite?
I think maybe @holgerroth can help test it before you release it.
Thanks.
Is the interrupt / resume feature already in the dev branch of ignite?
not yet, but soon :)
I think maybe @holgerroth can help test it before you release it.
sounds good 👍
Hi @vfdev-5 ,
When you have the first RC release, please let @holgerroth and me know and test ASAP.
Thanks in advance.
@holgerroth I have few questions about Engine.interrupt (== abort
from MonaiAlgo).
- While resuming from interrupted state (e.g. in a middle of an epoch) do we expect to see again Events.STARTED, Events.EPOCH_STARTED ?
More details on this question:
- When we first call
Engine.run
, then before interrupting the engine we see the following events:
STARTED, EPOCH_STARTED (1), ..., ITERATION_STARTED (1), ITERATION_COMPLETED (1), ..., EPOCH_COMPLETED (1), ...
where numbers (1)
indicate epoch and iteration indices.
- When we call
Engine.interrupt()
from a handler it would triggerINTERRUPT
event and exitEngine.run
method while properly storingdataloader_iter
for resuming. For an interruption in a middle of epoch, it would be
EPOCH_STARTED (12) ... ITERATION_STARTED (123), ITERATION_COMPLETED (123), INTERRUPT
- When we resume the run by calling again
Engine.run
, should we see once again the events STARTED, EPOCH_STARTED ?
STARTED, EPOCH_STARTED (12), ITERATION_STARTED (124), ITERATION_COMPLETED (124), ...
I would say that we may want to skip these events: STARTED, EPOCH_STARTED (12).
-
Any particular reason to store
dataloader_iter
in state when engine is interrupted ? -
About the code : https://github.com/Project-MONAI/monai-fl/blob/ab28be402d48687ed9d42f6c8afa1c0cda7e70b2/monaifl/monai_algo.py#L175-L177
if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
# if current iteration is end of 1 epoch, manually trigger epoch completed event
self.trainer._fire_event(Events.EPOCH_COMPLETED)
Can't we instead call engine.interrupt()
directly on Events.EPOCH_COMPLETED such that all necessary handlers were already executed ?
What do you think ?
Hi @vfdev-5, thanks for following up on this.
- I agree, we probably don't need to fire these events as the epoch has already started earlier.
- The intention was to save the dataloader state such that training could resume from that state.
- Yes, we want to call Ignite's API directly instead of using that code in MonaiAlgo.
Btw, is it possible to ensure that a terminated engine releases the GPU memory completely?
If we adopt my terms : engine.terminate and engine.interrupt (new feature), then for terminated engine it should be almost the case. I'm saying almost because engine.state
after exiting engine.run
still contains the last batch, last output and dataloader as attributes. Batch and output could be on CUDA device and thus wont be released.
As for engine.interrupt
and releasing GPU memory this could be probably complicated if we still keep dataloader iterator ... We have to experiment with that.
Yes, for interrupt()
, the memory release shouldn't be necessary but terminate()
should release it if possible, I feel.
Hi @vfdev-5 ,
I am just curious: is it possible to release the GPU memory if the python progress is not stopped? CC @holgerroth
Thanks.
@Nic-Ma allocated memory can be released (context memory maybe not):
import torch
print("-", torch.cuda.memory_allocated() / 1024 / 1024)
x = torch.rand(100, 100, 100, 100, device="cuda")
print("--", torch.cuda.memory_allocated() / 1024 / 1024)
x = None
print("---", torch.cuda.memory_allocated() / 1024 / 1024)
- 0.0
-- 382.0
--- 0.0
Hi @vfdev-5 ,
Thanks for your help in this ticket, may I know the first RC release time of this feature to test ASAP? As we still need to develop based on it and target Sep 12 for MONAI release, the schedule it very tight for us now.
Thanks in advance.
Hi @Nic-Ma
We will try to land the code this week and it will be on master and nightly release. Next, we'll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.
Thanks
Hi @holgerroth ,
@vfdev-5 did a PR to support this feature request: https://github.com/pytorch/ignite/pull/2682. Could you please help review it? See whether it can satisfy our requirements. Let's try to merge the PR and make a new ignite release ASAP.
Thanks in advance.