llm-foundry
llm-foundry copied to clipboard
Error in FSDP with composer
When finetuning an MPT-7B model with 8gpus, I get the following error when training is about to begin (after model and dataset loading etc.):
Traceback (most recent call last):
File "scripts/train/train.py", line 254, in <module>
main(cfg)
File "scripts/train/train.py", line 197, in main
trainer = Trainer(
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1330, in __init__
self._rng_state = checkpoint.load_checkpoint(
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 216, in load_checkpoint
rng_state_dicts = _restore_checkpoint(
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 446, in _restore_checkpoint
state_dict = safe_torch_load(composer_states_filepath)
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/composer/utils/checkpoint.py", line 421, in safe_torch_load
state_dict = torch.load(composer_states_filepath, map_location=map_location)
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/mnt/nvme/home/llm_foundry/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
super(_open_file, self).__init__(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmp7_pd5dbu/rank0_checkpoint'
This reference rank0_checkpoint and its parent tmp7_pd5dbu or whatever it is called for that run do not exist when I check.
Any idea on this?
Can you add the config you are running?
Essentially running the fientuning config from scripts/train/yamls/mpt/finetune/7b_dolly_sft.yaml with the Huggingface MPT-7B model on a server with 8xA100 80Gb.
Hi @bjoernpl , that YAML with load_path: ... is intended to point at a Composer checkpoint that holds your pretrained model. I think you may be pointing it at a HF folder checkpoint which is resulting in a IsADirectoryError.
To finetune our mosaicml/mpt-7b model that is available on the HF Hub, you'll want to use this slightly different YAML: https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml
This one does not use any load_path, but instead initializes a hf_causal_lm with the weights from HF Hub mosaicml/mpt-7b. Please let me know if this works! You can find more finetuning instructions here which we recently upgraded: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train#llm-finetuning