Cannot run geo3k multiturn example
System Info
I use the official image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
----------Python Info---------- Version : 3.12.3 Compiler : GCC 13.3.0 Build : ('main', 'Feb 4 2025 14:48:35') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.2 Directory : /usr/local/lib/python3.12/dist-packages/pip vllm : not found. sglang : 0.5.2 ray : 2.49.2 torch : 2.8.0 ----------verl Info----------- Version : 0.5.0.dev Directory : /app/verl/verl Commit Hash : 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2 ----------Platform Info---------- Platform : Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.39 system : Linux node : TENCENT64.site release : 5.4.241-1-tlinux4-0017.7 version : #1 SMP Thu Jan 18 11:33:00 CST 2024 ----------Environment---------- CUDA Runtime : 12.8 CUDA Compiler : Cuda compilation tools, release 12.8, V12.8.93 ----------System Info---------- CPU Memory : 2265.25 GB GPU Count : 8 GPU 1 Type : NVIDIA H20 GPU 1 Memory : 95.58 GB GPU 2 Type : NVIDIA H20 GPU 2 Memory : 95.58 GB GPU 3 Type : NVIDIA H20 GPU 3 Memory : 95.58 GB GPU 4 Type : NVIDIA H20 GPU 4 Memory : 95.58 GB GPU 5 Type : NVIDIA H20 GPU 5 Memory : 95.58 GB GPU 6 Type : NVIDIA H20 GPU 6 Memory : 95.58 GB GPU 7 Type : NVIDIA H20 GPU 7 Memory : 95.58 GB GPU 8 Type : NVIDIA H20 GPU 8 Memory : 95.58 GB
The multiturn example bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh encounters the following error.
ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=340926, ip=29.177.195.134, actor_id=c4162d864d53bb90020f271101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef66b0bc140>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/single_controller/ray/base.py", line 700, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper return func(self_instance, *args, **kwargs_inner) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/fsdp_workers.py", line 958, in compute_log_prob output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/performance.py", line 105, in f return self.log(decorated_function, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/performance.py", line 118, in log output = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob entropy, log_probs = self._forward_micro_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch output = self.actor_module( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward output = self._fsdp_wrapped_module(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 474, in forward_with_normal_backend outputs = qwen2_vl_forward(self, input_ids, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 447, in qwen2_vl_forward position_ids=process_position_ids(position_ids), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 397, in process_position_ids raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).") ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- pull the officail docker image
- run into container
- pull verl (commit 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2)
- pip install .
- python examples/data_preprocess/geo3k_multiturn_w_tool.py
- bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh
Expected behavior
Run correctly.
Hi, have you solved the issue? This PR works for me: https://github.com/volcengine/verl/pull/3653
I haven't tested it yet, are you using the same image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2?
I don't use docker image. I install from custom environment:
flashinfer-python==0.2.9rc2
torch==2.7.1
sgl-kernel==0.2.8
sglang==0.4.10.post2
torch_memory_saver==0.0.8
torchao==0.9.0
torchaudio==2.7.1
torchdata==0.11.0
torchvision==0.22.1
xformers==0.0.31
xgrammar==0.1.21
vllm==0.10.1.1
transformers==4.55.4
It might be that my CUDA version is relatively low (12.2), so sglang can't be used, but it can run with vllm.
Did you let the training finish? In my experiment, the validation phase can run properly, but it seems the training phase still has errors. I use the latest commit of verl.
(TaskRunner pid=90905) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=128760, ip=29.177.195.134, actor_id=6e7a2884496d1f808c89aa4401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fec1d4dc350>)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/single_controller/ray/base.py", line 700, in func
(TaskRunner pid=90905) return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner
(TaskRunner pid=90905) return func(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper
(TaskRunner pid=90905) return func(self_instance, *args, **kwargs_inner)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/workers/fsdp_workers.py", line 962, in compute_log_prob
(TaskRunner pid=90905) output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/utils/profiler/performance.py", line 105, in f
(TaskRunner pid=90905) return self.log(decorated_function, *args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/utils/profiler/performance.py", line 118, in log
(TaskRunner pid=90905) output = func(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob
(TaskRunner pid=90905) entropy, log_probs = self._forward_micro_batch(
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch
(TaskRunner pid=90905) output = self.actor_module(
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(TaskRunner pid=90905) return self._call_impl(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(TaskRunner pid=90905) return forward_call(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward
(TaskRunner pid=90905) output = self._fsdp_wrapped_module(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(TaskRunner pid=90905) return self._call_impl(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(TaskRunner pid=90905) return forward_call(*args, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/models/transformers/qwen2_vl.py", line 473, in forward_with_normal_backend
(TaskRunner pid=90905) outputs = qwen2_vl_forward(self, input_ids, **kwargs)
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/models/transformers/qwen2_vl.py", line 446, in qwen2_vl_forward
(TaskRunner pid=90905) position_ids=process_position_ids(position_ids),
(TaskRunner pid=90905) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905) File "/app/verl/verl/models/transformers/qwen2_vl.py", line 399, in process_position_ids
(TaskRunner pid=90905) raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).")
(TaskRunner pid=90905) ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).
cf619d68d4b15c736ff62c26cd16739c81556e94
Yes, I let the training finish. It seems that the error is still related to https://github.com/volcengine/verl/pull/3653?
It's hard to tell from the trace if the error is related to agent loop.
But https://github.com/volcengine/verl/pull/3653/files#diff-7d8baa13741a3ba9bfed072c1eb75619c83af4442d59598dca587e4fb49f9a3a seems to produce data of shape (batch_size, 4, seq_length) where verl/models/transformers/qwen2_vl.py expect data shape (4, batch_size, seq_length).
I am not sure if this is a mismatch.
same
I encountered the same issue.
When I tried to use the code from a few months ago, no error occurred.
After checking the current version, I found that the new implementation concatenates text_position_ids in position_ids (https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L317). As a result, verl/models/transformers/qwen2_vl.py expects the data shape to be (4, batch_size, seq_length), but the current input has a shape of (3, batch_size, seq_length). Could this be due to a missing concatenation of text_position_ids?
any progress? I met the same issue.
https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L317
If verl is receiving data from rl_dataset and the rollout is done using AsyncRolloutRequest, then would it be enough to just modify the rl_dataset part to solve this issue, rather than making changes related to vllm or sglang?
I looked through the code, and it seems that modifying this part shouldn’t be particularly difficult.
it seems like AsyncRolloutRequest will update the position_ids and discard prompts['position_ids']
it seems like
AsyncRolloutRequestwill update the position_ids and discard prompts['position_ids']
yes. AsyncRolloutRequest._get_position_ids should be updated. We can also use agent_loop instead, setting rollout.mode = async