torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[Checkpointing] fails out if checkpoint folder does not exist when using keep_latest_k

Open lessw2020 opened this issue 9 months ago • 1 comments

Bug description

Repro: spec same server file system (to workaround remote file system issue in issue 909). set keep_latest_k=4 to avoid overflowing drive set output folder to desired name: 'checkpoints/checkpoint_asynctp' run result: Fails at first checkpointing attempt as folder spec'ed for saving checkpoints does not exist:

Expectation - a - user should not have to go make a folder for the checkpointing. Expect checkpointer to handle this automatically. b - checkpointer should arguably make the checkpointing folder at the start of training and warn/fail then if there is a permission issue or other conflict with folder creation, up front. instead, user gets to find out checkpointing fails 250 iters in during overnight run....crashing out the run as below:

Root Cause (first observed failure):
[0]:
  time      : 2025-03-01_23:17:05
  host      : slurm-compute-node-79
  rank      : 426 (local_rank: 2)
  exitcode  : 1 (pid: 831046)
  error_file: /tmp/torchelastic_fnj4yjvt/101_razqu04x/attempt_0/2/error.json
  traceback : Traceback (most recent call last):
    File "/home/ubuntu/miniconda3/envs/titan/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/less/torchtitan/./torchtitan/train.py", line 407, in main
      checkpoint.save(
    File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 389, in save
      self._purge_stale_checkpoints()
    File "/home/ubuntu/less/torchtitan/torchtitan/components/checkpoint.py", line 472, in _purge_stale_checkpoints
      for filename in os.listdir(self.folder):
                      ^^^^^^^^^^^^^^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: './outputs/checkpoints/checkpoint_asynctp'

Versions

PyTorch nightly: 2.7.0.dev20250302+cu126 Latest Titan as of Mar 01, 25.

lessw2020 avatar Mar 02 '25 20:03 lessw2020

Let me know if the latest TorchTitan fixes the issue.

fegin avatar Mar 04 '25 07:03 fegin