[Bug] tensor warning
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
when i train 76B, set max_seq_length=2048, I got this warning:
warning: The size of tensor a (2253) must match the size of tensor b (3584) at non-singlet on dimension 0, input_embeds[selected].shape=torch.Size([2253, 8192]), vit_embeds.shape=torch.Size([3584, 8192])
Reproduction
set -x export NCCL_DEBUG=ERROR
model_name="internvl2-76B" OUTPUT_DIR=work_dirs/internvl2/$model_name
if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi
deepspeed --hostfile ./host.txt internvl/train/internvl_chat_finetune.py
--model_name_or_path "/InternVL2-Llama3-76B"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "test.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.1
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 True
--num_train_epochs 1
--per_device_train_batch_size 2
--gradient_accumulation_steps 2
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 1
--learning_rate 1e-6
--weight_decay 0.01
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 2048
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage3_config_70b.json"
--report_to "none"
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"
Environment
.
Error traceback
No response