open_flamingo icon indicating copy to clipboard operation
open_flamingo copied to clipboard

Multi-GPU training crashes with "Cannot copy out of meta tensor; no data"

Open rohan-mehta opened this issue 1 year ago • 0 comments

I cloned the repo, and ran the provided training command from here on 1 node, 2 GPUs, and it failed with the stack trace below. I've made no changes to the repo. Running on an 8xA100 GPU machine. It does work fine on a single GPU.

Expected Behavior

The training run should work correctly.

Current Behavior

Crashes when loading the language model. Full logs here: logs Excerpt:

Traceback (most recent call last):
  File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 484, in <module>
    main()
  File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 260, in main
    model, image_processor, tokenizer = create_model_and_transforms(
  File "/home/fsuser/open_flamingo/open_flamingo/src/factory.py", line 57, in create_model_and_transforms
    lang_encoder = AutoModelForCausalLM.from_pretrained(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
    return model_class.from_pretrained(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3084, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3525, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MosaicGPT:
        While copying the parameter named "transformer.wte.weight", whose dimensions in the model are torch.Size([50432, 2048]) and whose dimensions in the checkpoint are torch.Size([50432, 2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.blocks.0.ln_1.weight", whose dimensions in the model are torch.Size([2048]) and whose dimensions in the checkpoint are torch.Size([2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).

Steps to Reproduce

torchrun --nnodes=1 --nproc_per_node=2 train.py \
  --lm_path anas-awadalla/mpt-1b-redpajama-200b \
  --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
  --cross_attn_every_n_layers 2 \
  --dataset_resampled \
  --batch_size_mmc4 1 \
  --batch_size_laion 2 \
  --train_num_samples_mmc4 100\
  --train_num_samples_laion 200 \
  --loss_multiplier_laion 0.2 \
  --workers=4 \
  --run_name OpenFlamingo-3B-vitl-mpt1b \
  --num_epochs 10 \
  --warmup_steps  5 \
  --mmc4_textsim_threshold 0.24 \
  --laion_shards "..." \
  --mmc4_shards "..."

Environment

- Python 3.9
- Installed requirements from `requirements.txt`
- `conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia`

rohan-mehta avatar Aug 18 '23 15:08 rohan-mehta