tnt icon indicating copy to clipboard operation
tnt copied to clipboard

fix _retrieve_checkpoint_dirpaths

Open JKSenthil opened this issue 1 year ago • 2 comments

Summary:

Context

For directories containing _ in other parts of the name besides epoch_0_step_0 (ex tmp/fjad_213/epoch_0_step_0), _retrieve_checkpoint_dirpaths can raise errors as it splits on _ assuming underscore only appears in the final part of the path separating the epoch and step counts

>> ckpt_dirpaths.sort(key=lambda x: (int(x.split("_")[1]), int(x.split("_")[3])))
ValueError: invalid literal for int() with base 10: 'tmp/tmpcinmegj2/epoch'

This diff

When sorting the paths, calls os.path.basename(path) first, to only consider the epoch_0_step_0 part of the path.

Reviewed By: galrotem

Differential Revision: D51916358

JKSenthil avatar Dec 07 '23 18:12 JKSenthil