ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Sync trainer state with evaluators

Open vfdev-5 opened this issue 2 years ago • 10 comments

🚀 Feature

There can be use-cases when we would like to get trainer's epoch/iteration or/and other items from trainer.state. Let's propose an API such that we could get easily trainer's state from evaluator.

Context : https://discuss.pytorch.org/t/get-current-epoch-inside-process-function-of-evaluator/162926

vfdev-5 avatar Oct 06 '22 12:10 vfdev-5

Many handlers/metrics provide a global_step_transform as an argument to get the steps it wants.

louis-she avatar Oct 06 '22 14:10 louis-she

Can I work on this? I am pretty new to this

jalajk24 avatar Jan 29 '23 21:01 jalajk24

@jalajk24 right now it is still under discussions whether we need to work on something here. Do you have any ideas or suggestions on the topic ?

vfdev-5 avatar Jan 30 '23 08:01 vfdev-5

I am proposing a new API function for Engine class that can fetch the epoch from an instance of trainer. It can work in this way. This can also return the current trainer epoch

def fetch_trainer_epoch(trainer: Engine):
      epoch = trainer.state.epoch
      self.state.trainer_epoch = epoch
      return epoch

@vfdev-5 does this makes sense?

It can be called like optimizer.step()

guptaaryan16 avatar Feb 18 '23 05:02 guptaaryan16

The core question of the issue is whether to abstract a trainer in ignite. It's not a good idea from what I know of ignite, or at least the core of it.

louis-she avatar Feb 18 '23 06:02 louis-she

Hey @louis-she ,I guess the API can be helpful to compare the performances of two or more different training methods, also it can help in training of ensemble models. I have been working in the space of the GANs and adversarial training and I have noticed that sometimes you need to combine two training methods to get better results, so this may be a helpful addition in the Engine class

guptaaryan16 avatar Feb 18 '23 10:02 guptaaryan16

@guptaaryan16 can you please give a concrete example of what you are talking about ?

vfdev-5 avatar Feb 18 '23 10:02 vfdev-5

Sure @vfdev-5 , I think it will be mostly useful for hyperparameter tuning and testing of variation of results to make the training easier; like reducing the number of epochs and testing the different training methods.

For instance, I can share a small thing happened when I was training a model using Cifar-10 and Gaussian Augmentation training(https://arxiv.org/abs/1902.02918) to measure the Average Certified Radius(ACR) of the model using Randomized smoothing. There I noticed that if I included a PGD adversarial training(https://arxiv.org/pdf/1706.06083.pdf) in addition to the Gaussian Augmentation training I can get a very high ACR, but to get the specific hyper parameters you need to get the current training epoch and see where the evaluators are getting best results. So it may be helpful to have this API but you can also get the specific epoch without having this .

guptaaryan16 avatar Feb 18 '23 10:02 guptaaryan16

@guptaaryan16 thanks for details but I was wondering more about code details. Can you provide some code to highlight your idea. As for HP tuning and multiple experiments, you can check

  • HP tuning tutorial: https://github.com/pytorch/ignite/blob/master/examples/notebooks/Cifar10_Ax_hyperparam_tuning.ipynb
  • Experiment tracking e.g. with ClearML: https://pytorch-ignite.ai/how-to-guides/10-loggers/

get the specific hyper parameters you need to get the current training epoch and see where the evaluators are getting best results.

I think there is nothing impossible here. I imagine that you have a handler to run validation:

best_acr = 0.0

def run_validation():
    evaluator.run(val_data)
    metrics = evaluator.state.metrics
    if metrics["ACR"] > best_acr:
        best_acr = metrics["ACR"]
        current_epoch = trainer.state.epoch
        # save locally a bundle:
        fp = f"/path/to/output/{current_epoch}_best_acr.pt"
        torch.save({
            "best_acr": best_acr,
            "epoch": current_epoch,
            "model": model.state_dict(),
            ...
        })

vfdev-5 avatar Feb 18 '23 13:02 vfdev-5

yes @vfdev-5 I do not have the specific code for that but I can imagine that it was written along the same lines(that project did not use ignite ) Also I was thinking about can we access the epochs directly instead using the trainer.state.epoch to trainer.epoch as it can make a bit more sense because I don't think we can have different states within the same trainer anyways

guptaaryan16 avatar Feb 18 '23 17:02 guptaaryan16