llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Error in FSDP with composer

Open bjoernpl opened this issue 2 years ago • 2 comments

When finetuning an MPT-7B model with 8gpus, I get the following error when training is about to begin (after model and dataset loading etc.):

Traceback (most recent call last):
  File "scripts/train/train.py", line 254, in <module>
    main(cfg)
  File "scripts/train/train.py", line 197, in main
    trainer = Trainer(
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1330, in __init__
    self._rng_state = checkpoint.load_checkpoint(
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 216, in load_checkpoint
    rng_state_dicts = _restore_checkpoint(
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 446, in _restore_checkpoint
    state_dict = safe_torch_load(composer_states_filepath)
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 421, in safe_torch_load
    state_dict = torch.load(composer_states_filepath, map_location=map_location)
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmp7_pd5dbu/rank0_checkpoint'

This reference rank0_checkpoint and its parent tmp7_pd5dbu or whatever it is called for that run do not exist when I check.

Any idea on this?

bjoernpl avatar May 10 '23 20:05 bjoernpl

Can you add the config you are running?

vchiley avatar May 11 '23 05:05 vchiley

Essentially running the fientuning config from scripts/train/yamls/mpt/finetune/7b_dolly_sft.yaml with the Huggingface MPT-7B model on a server with 8xA100 80Gb.

bjoernpl avatar May 11 '23 11:05 bjoernpl

Hi @bjoernpl , that YAML with load_path: ... is intended to point at a Composer checkpoint that holds your pretrained model. I think you may be pointing it at a HF folder checkpoint which is resulting in a IsADirectoryError.

To finetune our mosaicml/mpt-7b model that is available on the HF Hub, you'll want to use this slightly different YAML: https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml

This one does not use any load_path, but instead initializes a hf_causal_lm with the weights from HF Hub mosaicml/mpt-7b. Please let me know if this works! You can find more finetuning instructions here which we recently upgraded: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train#llm-finetuning

abhi-mosaic avatar May 18 '23 20:05 abhi-mosaic