[Bug] tensor warning

Open aoji0606 opened this issue 1 year ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

when i train 76B, set max_seq_length=2048, I got this warning:

warning: The size of tensor a (2253) must match the size of tensor b (3584) at non-singlet on dimension 0, input_embeds[selected].shape=torch.Size([2253, 8192]), vit_embeds.shape=torch.Size([3584, 8192])

Reproduction

set -x export NCCL_DEBUG=ERROR

model_name="internvl2-76B" OUTPUT_DIR=work_dirs/internvl2/$model_name

if [ ! -d "$OUTPUT_DIR" ]; then mkdir -p "$OUTPUT_DIR" fi

deepspeed --hostfile ./host.txt internvl/train/internvl_chat_finetune.py
--model_name_or_path "/InternVL2-Llama3-76B"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "test.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 12
--down_sample_ratio 0.5
--drop_path_rate 0.1
--freeze_llm False
--freeze_mlp False
--freeze_backbone False
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 True
--num_train_epochs 1
--per_device_train_batch_size 2
--gradient_accumulation_steps 2
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 1
--learning_rate 1e-6
--weight_decay 0.01
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 2048
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage3_config_70b.json"
--report_to "none"
2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Environment

Error traceback

No response

Aug 07 '24 02:08 aoji0606