super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

Feature/sg 757 resume for spots

Open shaydeci opened this issue 1 year ago • 4 comments

Added support for resuming from a remote ckpt stored by the SG logger during training (meaning when sg_logger_params.save_checkpoints_remot=True).

For the base sg logger this is still problematic, as we dont have run ids for S3. Platform loggers- currently we cant download files from the platform except the ones they explicitly allow, I talked to @roikoren755 and once it will be possible- I will add the mechanism for the platform as well.

Regarding PR content:

  • I moved SG Logger initialization to be performed prior to checkpoint loading, so we can download the checkpoint which we wich to resume from.
  • I introduced a "resume_from_remote_sg_logger" training param, that when set will download "ckpt_name" into our checkpoints directory, then rsume training from it.

shaydeci avatar Apr 30 '23 12:04 shaydeci

looks good. only one comment: I don't understand why we need resume_from_remote_sg_logger when we already have the resumed

resumed is just a flag we need for wandb logger to continue logging properly (it is not explicitly passed but rather derrived from resume in training params). Even if it was specified for that- I think its better to pass this parameter through the training params (for me its more clear, but might be biased) for clarity/simplicity.

shaydeci avatar Apr 30 '23 12:04 shaydeci

Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.

BloodAxe avatar May 01 '23 10:05 BloodAxe

Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.

Sure, I added a section in our checkpoints.md in the latest commit.

shaydeci avatar May 01 '23 12:05 shaydeci