raise value error when running qwen2.5vl 3b to 200+ steps
I found that there will be random shape errors after running 200 steps.
It happens in https://github.com/volcengine/verl/blob/main/verl/workers/actor/dp_actor.py#L127,
raise ValueError: Image features and image tokens do not match: tokens: 2601, features 2600.
This is my train scripts. When running 200 + steps, geo3k data has been traversed multiple times. I wonder why this situation occurs, and the number of steps that occur is not fixed yet. Thank you.
set -x
export MMRL_ACC=1
ENGINE=${1:-vllm}
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/xxx/verl_data/train.parquet \
data.val_files=/xxx/verl_data/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=1024 \
data.max_response_length=2048 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.image_key=images \
actor_rollout_ref.model.path=/xxx/Qwen2.5-VL-3B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.01 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=$ENGINE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.rollout.free_cache_engine=False \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl_grpo_example_geo3k' \
trainer.experiment_name='qwen2_5_vl_7b_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
actor_rollout_ref.actor.use_torch_compile=False \
trainer.val_before_train=False \
trainer.total_epochs=150 $@
Sorry, I don't have an answer to your question, but I'm trying to run the same example script on Qwen 2.5 VL 3B, and I keep getting the following error:
TypeError: Qwen2_5_VLForConditionalGeneration.forward() got an unexpected keyword argument 'temperature'
I'm using the stable docker image whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3, and I also get this error for the 7B model.
Sorry, I don't have an answer to your question, but I'm trying to run the same example script on Qwen 2.5 VL 3B, and I keep getting the following error:
TypeError: Qwen2_5_VLForConditionalGeneration.forward() got an unexpected keyword argument 'temperature'I'm using the stable docker image
whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3, and I also get this error for the 7B model.
Same Problem Here, But I found the source of this problem. Maybe the latest commit (https://github.com/volcengine/verl/commit/4779f2616428a525746bdfb65be447bcdca3012e) induces temperature into actor_module forward process, you can try to comment them here (L131 and L199 in verl/workers/actor/dp_actor.py)
I had the same problem, image features and image tokens do not match:tokens:5261, features 5260
I had the same problem, image features and image tokens do not match:tokens:5261, features 5260
Have you solved it now?
I had the same problem, image features and image tokens do not match:tokens:5261, features 5260
Have you solved it now?
I haven't solved it yet, but I suspect that the problem is caused by data or tokenizer
I found that the model predicted a<| imagepasswd |>token. You can decode inputs_ids and see if it is the same reason
Hi, @zyx1213271098 , could you also share your code snippet in solving the issue?
有解决方案么?
有解决方案么?
出现这个问题,大概率模型已经训崩了。我是增加try,跳过有问题的step
Is this error only present in qwen2.5-vl-3B? I encountered the same issue, and even after updating to the latest code, I still get the same error And indeed, every time I encounter this problem, the training has already collapsed, and the model performance has been continuously declining during training. Could you please tell me if the reason for the performance decline is a problem with the verl code or my own configuration
Is this error only present in qwen2.5-vl-3B? I encountered the same issue, and even after updating to the latest code, I still get the same error And indeed, every time I encounter this problem, the training has already collapsed, and the model performance has been continuously declining during training. Could you please tell me if the reason for the performance decline is a problem with the verl code or my own configuration
Same issues in Qwen2.5VL 32B.
同样的问题 我训练的Qwen2.5VL 7B 多模态数据。大家有发现什么新的线索吗?
推理的时候遇到了同样的问题 qwen2.5 vl 7b
Here’s something similar:
For me, the issue arises from truncation in rl_dataset. I use ‘left’ truncation, and since the token is placed at the start of each sentence in my dataset, it can get truncated off. As a result, the image_placeholder may be removed, leading to fewer image tokens than features. However, in your situation, it’s the opposite: the number of tokens exceeds the number of features, which is unusual.
But, for your case, the number of tokens > number of features, it's quite strange.
有解决方案么?
出现这个问题,大概率模型已经训崩了。我是增加try,跳过有问题的step
可以问一下是怎么增加try吗 在ray_trainer.py中吗
@Chenzhou2344 , @zyx1213271098 Can you solve the promblem?, I need your help.
@onehaitao Can you solve the promblem?, I need your help.
@HanshuYAN, Can you solve the promblem?, I need your help.
@HanshuYAN, Can you solve the promblem?, I need your help.你能解决这个问题吗?我需要你的帮助。
I have same problem. File "/home/wangnn/anaconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1250, in forward raise ValueError( ValueError: Image features and image tokens do not match: tokens: 10271, features 10605
@HanshuYAN ,I Have same problem ,help me
有解决方案么?
出现这个问题,大概率模型已经训崩了。我是增加try,跳过有问题的step
可以问一下是怎么增加try吗 在ray_trainer.py中吗
I have used the try-block to surround all stuff from the beginning of the dataloader's step to the end. And if raising the exception, I will not increase the global step.
I encountered this problem, and found my side of issue was kind of stupid: I forgot to put