Stage 2 : RuntimeError: The expanded size of the tensor

Open cksthf3211 opened this issue 1 year ago • 1 comments

I'm encountering a RuntimeError during training related to tensor size mismatch. Below is the traceback for the error

PyTorch version: 2.1.0+cu118 CUDA version: 11.8 Python 3.11.10 Ubuntu 18.04

Traceback (most recent call last): File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 712, in <module> train() File "/media/path/SmartEdit-main/train/DS_MLLMSD11_train.py", line 501, in train model_.load_pretrain_MLLM_alignment(SD_QFormer_conversation_33tokens=SD_QFormer_conversation_33tokens, LLaVA_00002=LLaVA_00002) File "/media/path/SmartEdit-main/model/DS_MLLMSD11_model.py", line 221, in load_pretrain_MLLM_alignment self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The expanded size of the tensor (35) must match the existing size (33) at non-singleton dimension 0. Target sizes: [35, 4096]. Tensor sizes: [33, 4096] [2024-11-01 11:52:23,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149061

The error occurs when trying to assign LLaMA_lm_haed tensor to self.lm_head.weight.data[-self.config.num_new_tokens:]. The size of num_new_tokens is set to 35, but LLaMA_lm_haed has a size of 33. This causes a dimension mismatch, resulting in the RuntimeError

The tensor sizes should match during assignment to prevent this error.

Run the training script with the following configuration

`bash scripts/MLLMSD_7b.sh

wandb disabled export WANDB_DISABLED=true

checkpoint-150000_embeddings_qformer.bin -> checkpoint-50000.bin

deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28457 train/DS_MLLMSD11_train.py
--max_steps 5000
--model_name_or_path ./checkpoints/vicuna-7b-v1-1
--LLaVA_00001 "./checkpoints/LLaVA-7B-v1/pytorch_model-00001-of-00002.bin"
--LLaVA_00002 "./checkpoints/LLaVA-7B-v1/pytorch_model-00002-of-00002.bin"
--LLaVA_model_path "./checkpoints/LLaVA-7B-v1"
--sd_qformer_version "v1.1-7b"
--unet_ckpt "./checkpoints/InstructDiffusion_diffusers/unet/diffusion_pytorch_model.bin"
--bf16 True
--tf32 True
--output_dir ./checkpoints/stage2_MLLMSD_7b
--num_train_epochs 20
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy 'no'
--save_strategy 'steps'
--save_steps 5000
--save_total_limit 3
--learning_rate 1e-5
--lr_scheduler_type 'cosine'
--weight_decay 0.
--warmup_ratio 0.001
--logging_steps 1
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 16
--ddp_find_unused_parameters True
--SD_QFormer_conversation_33tokens "./checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-50000.bin"
--InstructPix2PixDataset_path "./dataset/InstructPix2PixCLIPFiltered_HF"
--MagicBrushDataset_path "./dataset/MagicBrush_HF"
--LLaVADataset_data_path "./dataset/LLaVA/llava_instruct_150k.json"
--LLaVADataset_image_folder "./dataset/coco/train2017"
--refcoco_path "./dataset/refcoco"
--grefcoco_path "./dataset/grefcoco"
--coco_image_path "./dataset/coco"
--COCOStuff_mask_path "./dataset/cocostuff"
--ReasoningEditingDataset_path "./dataset/SyntheticData/SyntheticData_info_new.json"
--ReasoningSegmentationDataset_json_path "./dataset/reason_seg/train"
--ReasoningSegmentationDataset_image_path "./dataset/reason_seg/train"
--ReasoningSegmentationDataset_binary_mask_path "./dataset/reason_seg/train_binary_mask"
--deepspeed scripts/zero2_mixed.json `

How do you solve this problem?

Nov 01 '24 03:11 cksthf3211

Have you solve the problem?

Jun 06 '25 16:06 baihuple