super-gradients
super-gradients copied to clipboard
Feature/sg 757 resume for spots
Added support for resuming from a remote ckpt stored by the SG logger during training (meaning when sg_logger_params.save_checkpoints_remot=True).
For the base sg logger this is still problematic, as we dont have run ids for S3. Platform loggers- currently we cant download files from the platform except the ones they explicitly allow, I talked to @roikoren755 and once it will be possible- I will add the mechanism for the platform as well.
Regarding PR content:
- I moved SG Logger initialization to be performed prior to checkpoint loading, so we can download the checkpoint which we wich to resume from.
- I introduced a "resume_from_remote_sg_logger" training param, that when set will download "ckpt_name" into our checkpoints directory, then rsume training from it.
looks good. only one comment: I don't understand why we need
resume_from_remote_sg_logger
when we already have theresumed
resumed is just a flag we need for wandb logger to continue logging properly (it is not explicitly passed but rather derrived from resume in training params). Even if it was specified for that- I think its better to pass this parameter through the training params (for me its more clear, but might be biased) for clarity/simplicity.
Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.
Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.
Sure, I added a section in our checkpoints.md in the latest commit.