open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

Resuming while using the same experiment folder

Open mehdidc opened this issue 1 year ago • 8 comments

Hello,

with @JeniaJitsev we are trying to have an auto-resume script that continues experiments automatically when freezing or finishing the reserved time in slurm. I noticed that when the experiment folder <logs>/<name> with --logs and --name is specified and re-used in different runs, it raises an error ("Error. Experiment already exists at "). I have a fork where that error is just ignored, and everything worked fine when auto-resuming, e.g. out.log was just extended (append mode) instead of being cleared in each run, same with results.jsonl for evaluation.

Is there something I am missing, or could we just allow using the same experiment folder ? if that would work, I could do a PR. Because that would make things easier, we would have, e.g., a fixed slurm script and we would not need to figure out the resume checkpoint path for each run (using --save-most-recent, we would have a fixed path for resuming from the latest checkpoint, i.e. <logs>/<name>/checkpoints/epoch_latest.pt), and we would be able to allowed to have one experiment folder per model (if wanted).

mehdidc avatar Sep 26 '22 13:09 mehdidc