open_flamingo
open_flamingo copied to clipboard
Multi-GPU training crashes with "Cannot copy out of meta tensor; no data"
I cloned the repo, and ran the provided training command from here on 1 node, 2 GPUs, and it failed with the stack trace below. I've made no changes to the repo. Running on an 8xA100 GPU machine. It does work fine on a single GPU.
Expected Behavior
The training run should work correctly.
Current Behavior
Crashes when loading the language model. Full logs here: logs Excerpt:
Traceback (most recent call last):
File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 484, in <module>
main()
File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 260, in main
model, image_processor, tokenizer = create_model_and_transforms(
File "/home/fsuser/open_flamingo/open_flamingo/src/factory.py", line 57, in create_model_and_transforms
lang_encoder = AutoModelForCausalLM.from_pretrained(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
return model_class.from_pretrained(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3084, in from_pretrained
) = cls._load_pretrained_model(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3525, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MosaicGPT:
While copying the parameter named "transformer.wte.weight", whose dimensions in the model are torch.Size([50432, 2048]) and whose dimensions in the checkpoint are torch.Size([50432, 2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
While copying the parameter named "transformer.blocks.0.ln_1.weight", whose dimensions in the model are torch.Size([2048]) and whose dimensions in the checkpoint are torch.Size([2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
Steps to Reproduce
torchrun --nnodes=1 --nproc_per_node=2 train.py \
--lm_path anas-awadalla/mpt-1b-redpajama-200b \
--tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
--cross_attn_every_n_layers 2 \
--dataset_resampled \
--batch_size_mmc4 1 \
--batch_size_laion 2 \
--train_num_samples_mmc4 100\
--train_num_samples_laion 200 \
--loss_multiplier_laion 0.2 \
--workers=4 \
--run_name OpenFlamingo-3B-vitl-mpt1b \
--num_epochs 10 \
--warmup_steps 5 \
--mmc4_textsim_threshold 0.24 \
--laion_shards "..." \
--mmc4_shards "..."
Environment
- Python 3.9
- Installed requirements from `requirements.txt`
- `conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia`