ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Improvements for resuming from terminated engine

Open vfdev-5 opened this issue 2 years ago • 1 comments

Current Engine's behaviour when we resume the run from terminated state is the following:

from ignite.engine import Engine, Events
from ignite.utils import setup_logger, logging


def func(engine, batch):
    print(engine.state.epoch, engine.state.iteration, " | ", batch)


max_epochs = 4
data = range(10)
engine = Engine(func)
engine.logger = setup_logger("engine", level=logging.DEBUG)


@engine.on(Events.ITERATION_COMPLETED(once=14))
def terminate():
    print("-> terminate")
    engine.terminate()

engine.run(data, max_epochs=max_epochs)
engine.run(data, max_epochs=max_epochs)
print(engine.state.epoch, engine.state.iteration)
Output:
2022-08-23 15:42:07,681 engine INFO: Engine run starting with max_epochs=4.
2022-08-23 15:42:07,682 engine DEBUG: 0 | 0, Firing handlers for event Events.STARTED
2022-08-23 15:42:07,683 engine DEBUG: 1 | 0, Firing handlers for event Events.EPOCH_STARTED
2022-08-23 15:42:07,683 engine DEBUG: 1 | 0, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,685 engine DEBUG: 1 | 0, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,687 engine DEBUG: 1 | 1, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,688 engine DEBUG: 1 | 1, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,688 engine DEBUG: 1 | 1, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,689 engine DEBUG: 1 | 1, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,690 engine DEBUG: 1 | 2, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,691 engine DEBUG: 1 | 2, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,692 engine DEBUG: 1 | 2, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,692 engine DEBUG: 1 | 2, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,693 engine DEBUG: 1 | 3, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,693 engine DEBUG: 1 | 3, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,695 engine DEBUG: 1 | 3, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,696 engine DEBUG: 1 | 3, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,697 engine DEBUG: 1 | 4, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,698 engine DEBUG: 1 | 4, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,698 engine DEBUG: 1 | 4, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,700 engine DEBUG: 1 | 4, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,701 engine DEBUG: 1 | 5, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,702 engine DEBUG: 1 | 5, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,703 engine DEBUG: 1 | 5, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,704 engine DEBUG: 1 | 5, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,706 engine DEBUG: 1 | 6, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,707 engine DEBUG: 1 | 6, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,708 engine DEBUG: 1 | 6, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,708 engine DEBUG: 1 | 6, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,709 engine DEBUG: 1 | 7, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,711 engine DEBUG: 1 | 7, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,712 engine DEBUG: 1 | 7, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,714 engine DEBUG: 1 | 7, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,714 engine DEBUG: 1 | 8, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,716 engine DEBUG: 1 | 8, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,717 engine DEBUG: 1 | 8, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,717 engine DEBUG: 1 | 8, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,719 engine DEBUG: 1 | 9, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,719 engine DEBUG: 1 | 9, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,720 engine DEBUG: 1 | 9, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,720 engine DEBUG: 1 | 9, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,721 engine DEBUG: 1 | 10, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,722 engine DEBUG: 1 | 10, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,722 engine DEBUG: 1 | 10, Firing handlers for event Events.EPOCH_COMPLETED
2022-08-23 15:42:07,723 engine INFO: Epoch[1] Complete. Time taken: 00:00:00.039
2022-08-23 15:42:07,723 engine DEBUG: 2 | 10, Firing handlers for event Events.EPOCH_STARTED
2022-08-23 15:42:07,724 engine DEBUG: 2 | 10, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,724 engine DEBUG: 2 | 10, Firing handlers for event Events.DATALOADER_STOP_ITERATION
2022-08-23 15:42:07,725 engine DEBUG: 2 | 10, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,725 engine DEBUG: 2 | 11, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,726 engine DEBUG: 2 | 11, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,727 engine DEBUG: 2 | 11, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,727 engine DEBUG: 2 | 11, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,728 engine DEBUG: 2 | 12, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,728 engine DEBUG: 2 | 12, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,729 engine DEBUG: 2 | 12, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,729 engine DEBUG: 2 | 12, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,730 engine DEBUG: 2 | 13, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,733 engine DEBUG: 2 | 13, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,734 engine DEBUG: 2 | 13, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,734 engine DEBUG: 2 | 13, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,735 engine DEBUG: 2 | 14, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,736 engine DEBUG: 2 | 14, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,737 engine INFO: Terminate signaled. Engine will stop after current iteration is finished.
2022-08-23 15:42:07,737 engine DEBUG: 2 | 14, Firing handlers for event Events.TERMINATE_SINGLE_EPOCH
2022-08-23 15:42:07,738 engine DEBUG: 2 | 14, Firing handlers for event Events.TERMINATE
2022-08-23 15:42:07,739 engine INFO: Epoch[2] Complete. Time taken: 00:00:00.015
2022-08-23 15:42:07,739 engine DEBUG: 2 | 14, Firing handlers for event Events.COMPLETED
2022-08-23 15:42:07,740 engine INFO: Engine run complete. Time taken: 00:00:00.058
2022-08-23 15:42:07,741 engine INFO: Engine run resuming from iteration 14, epoch 2 until 4 epochs
2022-08-23 15:42:07,741 engine DEBUG: 2 | 14, Firing handlers for event Events.STARTED
2022-08-23 15:42:07,742 engine DEBUG: 3 | 14, Firing handlers for event Events.EPOCH_STARTED
2022-08-23 15:42:07,743 engine DEBUG: 3 | 14, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,743 engine DEBUG: 3 | 14, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,744 engine DEBUG: 3 | 15, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,745 engine DEBUG: 3 | 15, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,746 engine DEBUG: 3 | 15, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,746 engine DEBUG: 3 | 15, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,747 engine DEBUG: 3 | 16, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,748 engine DEBUG: 3 | 16, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,748 engine DEBUG: 3 | 16, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,749 engine DEBUG: 3 | 16, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,749 engine DEBUG: 3 | 17, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,750 engine DEBUG: 3 | 17, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,750 engine DEBUG: 3 | 17, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,751 engine DEBUG: 3 | 17, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,751 engine DEBUG: 3 | 18, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,754 engine DEBUG: 3 | 18, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,755 engine DEBUG: 3 | 18, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,755 engine DEBUG: 3 | 18, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,756 engine DEBUG: 3 | 19, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,757 engine DEBUG: 3 | 19, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,758 engine DEBUG: 3 | 19, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,758 engine DEBUG: 3 | 19, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,759 engine DEBUG: 3 | 20, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,759 engine DEBUG: 3 | 20, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,761 engine DEBUG: 3 | 20, Firing handlers for event Events.EPOCH_COMPLETED
2022-08-23 15:42:07,761 engine INFO: Epoch[3] Complete. Time taken: 00:00:00.018
2022-08-23 15:42:07,762 engine DEBUG: 4 | 20, Firing handlers for event Events.EPOCH_STARTED
2022-08-23 15:42:07,762 engine DEBUG: 4 | 20, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,763 engine DEBUG: 4 | 20, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,763 engine DEBUG: 4 | 21, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,764 engine DEBUG: 4 | 21, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,764 engine DEBUG: 4 | 21, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,765 engine DEBUG: 4 | 21, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,766 engine DEBUG: 4 | 22, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,766 engine DEBUG: 4 | 22, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,767 engine DEBUG: 4 | 22, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,768 engine DEBUG: 4 | 22, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,769 engine DEBUG: 4 | 23, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,769 engine DEBUG: 4 | 23, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,770 engine DEBUG: 4 | 23, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,771 engine DEBUG: 4 | 23, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,771 engine DEBUG: 4 | 24, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,772 engine DEBUG: 4 | 24, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,773 engine DEBUG: 4 | 24, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,774 engine DEBUG: 4 | 24, Firing handlers for event Events.DATALOADER_STOP_ITERATION
2022-08-23 15:42:07,774 engine DEBUG: 4 | 24, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,775 engine DEBUG: 4 | 25, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,776 engine DEBUG: 4 | 25, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,776 engine DEBUG: 4 | 25, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,777 engine DEBUG: 4 | 25, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,778 engine DEBUG: 4 | 26, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,779 engine DEBUG: 4 | 26, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,780 engine DEBUG: 4 | 26, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,781 engine DEBUG: 4 | 26, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,781 engine DEBUG: 4 | 27, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,782 engine DEBUG: 4 | 27, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,783 engine DEBUG: 4 | 27, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,783 engine DEBUG: 4 | 27, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,784 engine DEBUG: 4 | 28, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,784 engine DEBUG: 4 | 28, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,785 engine DEBUG: 4 | 28, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,785 engine DEBUG: 4 | 28, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,786 engine DEBUG: 4 | 29, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,787 engine DEBUG: 4 | 29, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,787 engine DEBUG: 4 | 29, Firing handlers for event Events.GET_BATCH_STARTED
2022-08-23 15:42:07,788 engine DEBUG: 4 | 29, Firing handlers for event Events.GET_BATCH_COMPLETED
2022-08-23 15:42:07,788 engine DEBUG: 4 | 30, Firing handlers for event Events.ITERATION_STARTED
2022-08-23 15:42:07,789 engine DEBUG: 4 | 30, Firing handlers for event Events.ITERATION_COMPLETED
2022-08-23 15:42:07,790 engine DEBUG: 4 | 30, Firing handlers for event Events.EPOCH_COMPLETED
2022-08-23 15:42:07,791 engine INFO: Epoch[4] Complete. Time taken: 00:00:00.029
2022-08-23 15:42:07,791 engine DEBUG: 4 | 30, Firing handlers for event Events.COMPLETED
2022-08-23 15:42:07,792 engine INFO: Engine run complete. Time taken: 00:00:00.051

1 1  |  0
1 2  |  1
1 3  |  2
1 4  |  3
1 5  |  4
1 6  |  5
1 7  |  6
1 8  |  7
1 9  |  8
1 10  |  9
2 11  |  0
2 12  |  1
2 13  |  2
2 14  |  3
-> terminate
3 15  |  0
3 16  |  1
3 17  |  2
3 18  |  3
3 19  |  4
3 20  |  5
4 21  |  6
4 22  |  7
4 23  |  8
4 24  |  9
4 25  |  0
4 26  |  1
4 27  |  2
4 28  |  3
4 29  |  4
4 30  |  5

(4, 30)

We can see that the engine stopped at epoch 2 and when engine is resumed, epochs are counted from 3. The epoch 3 does not count 10 iterations and ends earlier (after 6 iterations).

Maybe, an expected behaviour while resuming could be to count epochs from 2 and count 10 iterations before switching to the epoch 4 ?

@sadra-barikbin what do you think ?

EDIT: a suggestion from https://github.com/pytorch/ignite/pull/2645#discussion_r952675098

It's due to iter_counter local variable (in run_once_on_dataset) which has been initialized as state.iteration % epoch_length (3) and causes engine to stop 7 batches later when iter_counter == epoch_length.

Hence, iter_counter should start with zero.

vfdev-5 avatar Aug 23 '22 13:08 vfdev-5

I agree with Sadra's comment here: https://github.com/pytorch/ignite/pull/2645#discussion_r952231156 that final iteration should be 34 == epoch_length * (max_epochs - 1) + (iteration_to_stop % epoch_length)

vfdev-5 avatar Aug 29 '22 10:08 vfdev-5