[Checkpointing] fails out if checkpoint folder does not exist when using keep_latest_k
Bug description
Repro: spec same server file system (to workaround remote file system issue in issue 909). set keep_latest_k=4 to avoid overflowing drive set output folder to desired name: 'checkpoints/checkpoint_asynctp' run result: Fails at first checkpointing attempt as folder spec'ed for saving checkpoints does not exist:
Expectation - a - user should not have to go make a folder for the checkpointing. Expect checkpointer to handle this automatically. b - checkpointer should arguably make the checkpointing folder at the start of training and warn/fail then if there is a permission issue or other conflict with folder creation, up front. instead, user gets to find out checkpointing fails 250 iters in during overnight run....crashing out the run as below:
Root Cause (first observed failure):
[0]:
time : 2025-03-01_23:17:05
host : slurm-compute-node-79
rank : 426 (local_rank: 2)
exitcode : 1 (pid: 831046)
error_file: /tmp/torchelastic_fnj4yjvt/101_razqu04x/attempt_0/2/error.json
traceback : Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/titan/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 407, in main
checkpoint.save(
File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 389, in save
self._purge_stale_checkpoints()
File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 472, in _purge_stale_checkpoints
for filename in os.listdir(self.folder):
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './outputs/checkpoints/checkpoint_asynctp'
Versions
PyTorch nightly: 2.7.0.dev20250302+cu126 Latest Titan as of Mar 01, 25.
Let me know if the latest TorchTitan fixes the issue.