torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[Checkpointing] Using keep_latest_k setting results in failure when using external mounted drive

Open lessw2020 opened this issue 10 months ago • 1 comments

Bug description

In order to not overflow our host drive, set 'keep_latest_k' = 2 in toml file for 1400 gpu run. At first checkpoint, it failed out with:

[rank1186]: Traceback (most recent call last):
[rank1186]:   File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 437, in <module>
[rank1186]:     main(config)
[rank1186]:   File "/home/ubuntu/miniconda3/envs/titan/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 354, in wrapper
[rank1186]:     return f(*args, **kwargs)
[rank1186]:            ^^^^^^^^^^^^^^^^^^
[rank1186]:   File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 407, in main
[rank1186]:     checkpoint.save(
[rank1186]:   File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 389, in save
[rank1186]:     self._purge_stale_checkpoints()
[rank1186]:   File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 472, in _purge_stale_checkpoints
[rank1186]:     for filename in os.listdir(self.folder):
[rank1186]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank1186]: FileNotFoundError: [Errno 2] No such file or directory: './outputs/test'

the location spec'ed for saving checkpoints was a different NFS drive. Full toml section below:

[checkpoint]
enable_checkpoint = true
folder = "test" # "/mnt/data/checkpoints/checkpoint_asynctp"
interval_type = "steps"
interval = 250
model_weights_only = false
keep_latest_k=2
export_dtype = "float32"
async_mode = "async_with_pinned_mem" # ["disabled", "async", "async_with_pinned_mem"]

Versions

PyTorch nightly: '2.7.0.dev20250228+cu128'

lessw2020 avatar Mar 01 '25 13:03 lessw2020

I think this won't happen with the latest TorchTitan as we added os.path.isdir(self.folder) to check. We probably need to use fsspec to delete files.

fegin avatar Mar 04 '25 07:03 fegin