MONAI icon indicating copy to clipboard operation
MONAI copied to clipboard

Add interrupt/terminate option to Trainers & Evaluators

Open holgerroth opened this issue 2 years ago • 19 comments

Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.

Describe the solution you'd like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.

For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL's MonaiAlgo class; private repo - contact me if you need access)

    def abort(self):
        self.trainer.terminate()
        # save current iteration for next round
        setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)

        if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
            # if current iteration is end of 1 epoch, manually trigger epoch completed event
            self.trainer._fire_event(Events.EPOCH_COMPLETED)

    def finalize(self):
        self.trainer.terminate()

Describe alternatives you've considered n/a

Additional context n/a

holgerroth avatar Jun 21 '22 20:06 holgerroth

This functionality should also be supported by evaluation/inference workflows.

Consider calling it interrupt() and terminate() to better align with the Ignite naming convention.

holgerroth avatar Jun 22 '22 15:06 holgerroth

FYI, interrupt() is necessary to support iteration-based aggregation workflows in FL.

holgerroth avatar Jun 22 '22 15:06 holgerroth

Hi @vfdev-5 ,

I think this feature request is something similar to pause / resume logic in ignite engine. What do you think?

Thanks.

Nic-Ma avatar Jun 22 '22 15:06 Nic-Ma

Thanks @holgerroth and @Nic-Ma for bringing this topic for the discussion.

This is one of the desired features we are waiting for.

The interrupt calls should be supported by Ignite trainer for MONAI apps.

Considering the complete training or evaluation cycles in remote machines in FL environments. Following interrupt calls could be handy:

  1. trainer_pause --> to pause the current training iteration
  2. trainer_resume --> to resume from last paused training iteration
  3. trainer_abort --> to terminate current iteration and discard any update in the model_state_dict
  4. evaluator_abort --> to terminate current evaluation and discard any computed metric

NVDIA, MONAI, and Ignite developers can suggest better and more compliant naming conventions for these functionalities.

drmhrehman avatar Jun 22 '22 16:06 drmhrehman

Hi everyone , thanks for pinging.

Yes, interupt feature may be nice to have in ignite. Today, it is possible to resume training from stopped trainer (under certain assumptions). I'll give an example below.

As for pausing and resuming current training iteration, this can be a bit tricky regarding the input dataloader. Basically, we have to store the iterator and make sure that resuming continues iterating over new batches and not from the begging...

Let me prioritize this feature from ignite side.

Run/Resume Ignite Engine logic:

# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch

# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7

# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

vfdev-5 avatar Jun 22 '22 16:06 vfdev-5

Hi @vfdev-5 ,

Cool, glad to see your help here. Currently, we use a workaround to support iteration pause / resume:

setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)

If you can support from ignite natively, that would be great!

Thanks in advance.

Nic-Ma avatar Jun 23 '22 02:06 Nic-Ma

Hi @vfdev-5 ,

Do you have any development plan for this feature now?

Thanks in advance.

Nic-Ma avatar Jun 28 '22 22:06 Nic-Ma

Hi @Nic-Ma , this feature has been already in the plan for Engine refactoring. We have to push forward a bit more on that.

vfdev-5 avatar Jun 28 '22 22:06 vfdev-5

Hi @vfdev-5 ,

Cool! We are planning for the next release of MONAI, do you have any more concrete release plan for this feature in ignite?

Thanks.

Nic-Ma avatar Jun 28 '22 22:06 Nic-Ma

Unfortunately, Engine refactoring can be complicated to release quickly. It lasts since some time already and I feel that we have to change the strategy for adopting such large things... Maybe we could just add the feature you would like first and do the refactoring a bit later. I'll let you know once we have concrete schedule.

vfdev-5 avatar Jun 28 '22 22:06 vfdev-5

Hi @Nic-Ma , sorry for delay, when you plan to your next release ? Perhaps, we could make the feature and release before ...

vfdev-5 avatar Jun 30 '22 16:06 vfdev-5

Hi @vfdev-5, thanks for the consideration! in mid-September we plan to have a major milestone and ideally we could include this feature from pytorch-ignite.

wyli avatar Jun 30 '22 16:06 wyli

Hi @vfdev-5 ,

Do you have any update from ignite side? We are working on the FL module for the next release at early Sep, which depends on the interrupt / terminate logic.

Thanks in advance.

Nic-Ma avatar Aug 03 '22 15:08 Nic-Ma

@Nic-Ma let us try to include this feature to our next upcoming release

vfdev-5 avatar Aug 03 '22 21:08 vfdev-5

Hi @vfdev-5 ,

Cool, may I know your next release time?

Thanks.

Nic-Ma avatar Aug 03 '22 23:08 Nic-Ma

Hi @vfdev-5 , thanks for your help here, we plan to code-freeze at early Sep for MONAI 1.0. May I know when is your next release time?

Thanks in advance.

Nic-Ma avatar Aug 10 '22 15:08 Nic-Ma

Sorry for delayed reply @Nic-Ma , our plan is to release around the end of august.

vfdev-5 avatar Aug 10 '22 15:08 vfdev-5

Is the interrupt / resume feature already in the dev branch of ignite? I think maybe @holgerroth can help test it before you release it.

Thanks.

Nic-Ma avatar Aug 10 '22 15:08 Nic-Ma

Is the interrupt / resume feature already in the dev branch of ignite?

not yet, but soon :)

I think maybe @holgerroth can help test it before you release it.

sounds good 👍

vfdev-5 avatar Aug 10 '22 15:08 vfdev-5

Hi @vfdev-5 ,

When you have the first RC release, please let @holgerroth and me know and test ASAP.

Thanks in advance.

Nic-Ma avatar Aug 15 '22 14:08 Nic-Ma

@holgerroth I have few questions about Engine.interrupt (== abort from MonaiAlgo).

  1. While resuming from interrupted state (e.g. in a middle of an epoch) do we expect to see again Events.STARTED, Events.EPOCH_STARTED ?

More details on this question:

  • When we first call Engine.run, then before interrupting the engine we see the following events:
STARTED, EPOCH_STARTED (1), ..., ITERATION_STARTED (1), ITERATION_COMPLETED (1), ..., EPOCH_COMPLETED (1), ...

where numbers (1) indicate epoch and iteration indices.

  • When we call Engine.interrupt() from a handler it would trigger INTERRUPT event and exit Engine.run method while properly storing dataloader_iter for resuming. For an interruption in a middle of epoch, it would be
EPOCH_STARTED (12) ... ITERATION_STARTED (123), ITERATION_COMPLETED (123), INTERRUPT
  • When we resume the run by calling again Engine.run, should we see once again the events STARTED, EPOCH_STARTED ?
STARTED, EPOCH_STARTED (12), ITERATION_STARTED (124), ITERATION_COMPLETED (124), ...

I would say that we may want to skip these events: STARTED, EPOCH_STARTED (12).

  1. Any particular reason to store dataloader_iter in state when engine is interrupted ?

  2. About the code : https://github.com/Project-MONAI/monai-fl/blob/ab28be402d48687ed9d42f6c8afa1c0cda7e70b2/monaifl/monai_algo.py#L175-L177

            if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
                # if current iteration is end of 1 epoch, manually trigger epoch completed event
                self.trainer._fire_event(Events.EPOCH_COMPLETED)

Can't we instead call engine.interrupt() directly on Events.EPOCH_COMPLETED such that all necessary handlers were already executed ?

What do you think ?

vfdev-5 avatar Aug 16 '22 08:08 vfdev-5

Hi @vfdev-5, thanks for following up on this.

  1. I agree, we probably don't need to fire these events as the epoch has already started earlier.
  2. The intention was to save the dataloader state such that training could resume from that state.
  3. Yes, we want to call Ignite's API directly instead of using that code in MonaiAlgo.

Btw, is it possible to ensure that a terminated engine releases the GPU memory completely?

holgerroth avatar Aug 16 '22 14:08 holgerroth

If we adopt my terms : engine.terminate and engine.interrupt (new feature), then for terminated engine it should be almost the case. I'm saying almost because engine.state after exiting engine.run still contains the last batch, last output and dataloader as attributes. Batch and output could be on CUDA device and thus wont be released.

As for engine.interrupt and releasing GPU memory this could be probably complicated if we still keep dataloader iterator ... We have to experiment with that.

vfdev-5 avatar Aug 16 '22 14:08 vfdev-5

Yes, for interrupt(), the memory release shouldn't be necessary but terminate() should release it if possible, I feel.

holgerroth avatar Aug 16 '22 14:08 holgerroth

Hi @vfdev-5 ,

I am just curious: is it possible to release the GPU memory if the python progress is not stopped? CC @holgerroth

Thanks.

Nic-Ma avatar Aug 17 '22 16:08 Nic-Ma

@Nic-Ma allocated memory can be released (context memory maybe not):

import torch

print("-", torch.cuda.memory_allocated() / 1024 / 1024)

x = torch.rand(100, 100, 100, 100, device="cuda")
print("--", torch.cuda.memory_allocated() / 1024 / 1024)

x = None
print("---", torch.cuda.memory_allocated() / 1024 / 1024)
- 0.0
-- 382.0
--- 0.0

vfdev-5 avatar Aug 17 '22 17:08 vfdev-5

Hi @vfdev-5 ,

Thanks for your help in this ticket, may I know the first RC release time of this feature to test ASAP? As we still need to develop based on it and target Sep 12 for MONAI release, the schedule it very tight for us now.

Thanks in advance.

Nic-Ma avatar Aug 22 '22 15:08 Nic-Ma

Hi @Nic-Ma

We will try to land the code this week and it will be on master and nightly release. Next, we'll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.

Thanks

vfdev-5 avatar Aug 23 '22 06:08 vfdev-5

Hi @holgerroth ,

@vfdev-5 did a PR to support this feature request: https://github.com/pytorch/ignite/pull/2682. Could you please help review it? See whether it can satisfy our requirements. Let's try to merge the PR and make a new ignite release ASAP.

Thanks in advance.

Nic-Ma avatar Aug 31 '22 06:08 Nic-Ma