open_clip
open_clip copied to clipboard
Resuming while using the same experiment folder
Hello,
with @JeniaJitsev we are trying to have an auto-resume script that continues experiments automatically
when freezing or finishing the reserved time in slurm.
I noticed that when the experiment folder <logs>/<name>
with --logs
and --name
is specified and re-used in different runs, it raises an error ("Error. Experiment already exists at "). I have a fork where that error is just ignored, and everything worked fine when auto-resuming, e.g. out.log
was just extended (append mode) instead of being cleared in each run, same with results.jsonl
for evaluation.
Is there something I am missing, or could we just allow using the same experiment folder ? if that would work, I could do a PR.
Because that would make things easier, we would have, e.g., a fixed slurm script and we would not need to figure out the resume checkpoint path for each run (using --save-most-recent
, we would have a fixed path for resuming from the latest checkpoint, i.e. <logs>/<name>/checkpoints/epoch_latest.pt
), and we would be able to allowed to have one experiment folder per model (if wanted).