torchtitan
torchtitan copied to clipboard
[Checkpointing] Using keep_latest_k setting results in failure when using external mounted drive
Bug description
In order to not overflow our host drive, set 'keep_latest_k' = 2 in toml file for 1400 gpu run. At first checkpoint, it failed out with:
[rank1186]: Traceback (most recent call last):
[rank1186]: File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 437, in <module>
[rank1186]: main(config)
[rank1186]: File "/home/ubuntu/miniconda3/envs/titan/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 354, in wrapper
[rank1186]: return f(*args, **kwargs)
[rank1186]: ^^^^^^^^^^^^^^^^^^
[rank1186]: File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 407, in main
[rank1186]: checkpoint.save(
[rank1186]: File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 389, in save
[rank1186]: self._purge_stale_checkpoints()
[rank1186]: File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 472, in _purge_stale_checkpoints
[rank1186]: for filename in os.listdir(self.folder):
[rank1186]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1186]: FileNotFoundError: [Errno 2] No such file or directory: './outputs/test'
the location spec'ed for saving checkpoints was a different NFS drive. Full toml section below:
[checkpoint]
enable_checkpoint = true
folder = "test" # "/mnt/data/checkpoints/checkpoint_asynctp"
interval_type = "steps"
interval = 250
model_weights_only = false
keep_latest_k=2
export_dtype = "float32"
async_mode = "async_with_pinned_mem" # ["disabled", "async", "async_with_pinned_mem"]
Versions
PyTorch nightly: '2.7.0.dev20250228+cu128'
I think this won't happen with the latest TorchTitan as we added os.path.isdir(self.folder) to check. We probably need to use fsspec to delete files.