tnt
tnt copied to clipboard
fix _retrieve_checkpoint_dirpaths
Summary:
Context
For directories containing _
in other parts of the name besides epoch_0_step_0
(ex tmp/fjad_213/epoch_0_step_0
), _retrieve_checkpoint_dirpaths
can raise errors as it splits on _
assuming underscore only appears in the final part of the path separating the epoch and step counts
>> ckpt_dirpaths.sort(key=lambda x: (int(x.split("_")[1]), int(x.split("_")[3])))
ValueError: invalid literal for int() with base 10: 'tmp/tmpcinmegj2/epoch'
This diff
When sorting the paths, calls os.path.basename(path)
first, to only consider the epoch_0_step_0
part of the path.
Reviewed By: galrotem
Differential Revision: D51916358