verl icon indicating copy to clipboard operation
verl copied to clipboard

Cannot run geo3k multiturn example

Open huaiyizhao opened this issue 2 months ago • 13 comments

System Info

I use the official image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2


----------Python Info---------- Version : 3.12.3 Compiler : GCC 13.3.0 Build : ('main', 'Feb 4 2025 14:48:35') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.2 Directory : /usr/local/lib/python3.12/dist-packages/pip vllm : not found. sglang : 0.5.2 ray : 2.49.2 torch : 2.8.0 ----------verl Info----------- Version : 0.5.0.dev Directory : /app/verl/verl Commit Hash : 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2 ----------Platform Info---------- Platform : Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.39 system : Linux node : TENCENT64.site release : 5.4.241-1-tlinux4-0017.7 version : #1 SMP Thu Jan 18 11:33:00 CST 2024 ----------Environment---------- CUDA Runtime : 12.8 CUDA Compiler : Cuda compilation tools, release 12.8, V12.8.93 ----------System Info---------- CPU Memory : 2265.25 GB GPU Count : 8 GPU 1 Type : NVIDIA H20 GPU 1 Memory : 95.58 GB GPU 2 Type : NVIDIA H20 GPU 2 Memory : 95.58 GB GPU 3 Type : NVIDIA H20 GPU 3 Memory : 95.58 GB GPU 4 Type : NVIDIA H20 GPU 4 Memory : 95.58 GB GPU 5 Type : NVIDIA H20 GPU 5 Memory : 95.58 GB GPU 6 Type : NVIDIA H20 GPU 6 Memory : 95.58 GB GPU 7 Type : NVIDIA H20 GPU 7 Memory : 95.58 GB GPU 8 Type : NVIDIA H20 GPU 8 Memory : 95.58 GB


The multiturn example bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh encounters the following error.

ray.exceptions.RayTaskError(ValueError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=340926, ip=29.177.195.134, actor_id=c4162d864d53bb90020f271101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef66b0bc140>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/single_controller/ray/base.py", line 700, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper return func(self_instance, *args, **kwargs_inner) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/fsdp_workers.py", line 958, in compute_log_prob output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/performance.py", line 105, in f return self.log(decorated_function, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/utils/profiler/performance.py", line 118, in log output = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob entropy, log_probs = self._forward_micro_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch output = self.actor_module( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward output = self._fsdp_wrapped_module(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 474, in forward_with_normal_backend outputs = qwen2_vl_forward(self, input_ids, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 447, in qwen2_vl_forward position_ids=process_position_ids(position_ids), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/verl/verl/models/transformers/qwen2_vl.py", line 397, in process_position_ids raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).") ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. pull the officail docker image
  2. run into container
  3. pull verl (commit 362ebfbcaf6d37c50003fef60f2176f9f76aaeb2)
  4. pip install .
  5. python examples/data_preprocess/geo3k_multiturn_w_tool.py
  6. bash examples/sglang_multiturn/geo3k/run_qwen2.5-3b_geo3k_multiturn.sh

Expected behavior

Run correctly.

huaiyizhao avatar Sep 29 '25 09:09 huaiyizhao

Hi, have you solved the issue? This PR works for me: https://github.com/volcengine/verl/pull/3653

0001Henry avatar Oct 01 '25 04:10 0001Henry

I haven't tested it yet, are you using the same image app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2?

huaiyizhao avatar Oct 01 '25 04:10 huaiyizhao

I don't use docker image. I install from custom environment:

flashinfer-python==0.2.9rc2
torch==2.7.1
sgl-kernel==0.2.8
sglang==0.4.10.post2
torch_memory_saver==0.0.8
torchao==0.9.0
torchaudio==2.7.1
torchdata==0.11.0
torchvision==0.22.1
xformers==0.0.31
xgrammar==0.1.21
vllm==0.10.1.1
transformers==4.55.4

It might be that my CUDA version is relatively low (12.2), so sglang can't be used, but it can run with vllm.

0001Henry avatar Oct 01 '25 05:10 0001Henry

Did you let the training finish? In my experiment, the validation phase can run properly, but it seems the training phase still has errors. I use the latest commit of verl.

(TaskRunner pid=90905) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=128760, ip=29.177.195.134, actor_id=6e7a2884496d1f808c89aa4401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fec1d4dc350>)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/single_controller/ray/base.py", line 700, in func
(TaskRunner pid=90905)     return getattr(self.worker_dict[key], name)(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/single_controller/base/decorator.py", line 433, in inner
(TaskRunner pid=90905)     return func(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/utils/profiler/profile.py", line 256, in wrapper
(TaskRunner pid=90905)     return func(self_instance, *args, **kwargs_inner)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/workers/fsdp_workers.py", line 962, in compute_log_prob
(TaskRunner pid=90905)     output, entropys = self.actor.compute_log_prob(data=data, calculate_entropy=True)
(TaskRunner pid=90905)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/utils/profiler/performance.py", line 105, in f
(TaskRunner pid=90905)     return self.log(decorated_function, *args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/utils/profiler/performance.py", line 118, in log
(TaskRunner pid=90905)     output = func(*args, **kwargs)
(TaskRunner pid=90905)              ^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/workers/actor/dp_actor.py", line 339, in compute_log_prob
(TaskRunner pid=90905)     entropy, log_probs = self._forward_micro_batch(
(TaskRunner pid=90905)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/workers/actor/dp_actor.py", line 170, in _forward_micro_batch
(TaskRunner pid=90905)     output = self.actor_module(
(TaskRunner pid=90905)              ^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(TaskRunner pid=90905)     return self._call_impl(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(TaskRunner pid=90905)     return forward_call(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 854, in forward
(TaskRunner pid=90905)     output = self._fsdp_wrapped_module(*args, **kwargs)
(TaskRunner pid=90905)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(TaskRunner pid=90905)     return self._call_impl(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(TaskRunner pid=90905)     return forward_call(*args, **kwargs)
(TaskRunner pid=90905)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/models/transformers/qwen2_vl.py", line 473, in forward_with_normal_backend
(TaskRunner pid=90905)     outputs = qwen2_vl_forward(self, input_ids, **kwargs)
(TaskRunner pid=90905)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/models/transformers/qwen2_vl.py", line 446, in qwen2_vl_forward
(TaskRunner pid=90905)     position_ids=process_position_ids(position_ids),
(TaskRunner pid=90905)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=90905)   File "/app/verl/verl/models/transformers/qwen2_vl.py", line 399, in process_position_ids
(TaskRunner pid=90905)     raise ValueError("position_ids should be a 3D tensor of shape (4, batch_size, seq_length).")
(TaskRunner pid=90905) ValueError: position_ids should be a 3D tensor of shape (4, batch_size, seq_length).

huaiyizhao avatar Oct 10 '25 03:10 huaiyizhao

cf619d68d4b15c736ff62c26cd16739c81556e94

huaiyizhao avatar Oct 10 '25 03:10 huaiyizhao

Yes, I let the training finish. It seems that the error is still related to https://github.com/volcengine/verl/pull/3653?

0001Henry avatar Oct 10 '25 04:10 0001Henry

It's hard to tell from the trace if the error is related to agent loop.
But https://github.com/volcengine/verl/pull/3653/files#diff-7d8baa13741a3ba9bfed072c1eb75619c83af4442d59598dca587e4fb49f9a3a seems to produce data of shape (batch_size, 4, seq_length) where verl/models/transformers/qwen2_vl.py expect data shape (4, batch_size, seq_length). I am not sure if this is a mismatch.

huaiyizhao avatar Oct 10 '25 05:10 huaiyizhao

same

HJYao00 avatar Oct 13 '25 09:10 HJYao00

I encountered the same issue.

When I tried to use the code from a few months ago, no error occurred.

After checking the current version, I found that the new implementation concatenates text_position_ids in position_ids (https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L317). As a result, verl/models/transformers/qwen2_vl.py expects the data shape to be (4, batch_size, seq_length), but the current input has a shape of (3, batch_size, seq_length). Could this be due to a missing concatenation of text_position_ids?

HJYao00 avatar Oct 14 '25 12:10 HJYao00

any progress? I met the same issue.

HackGiter avatar Nov 01 '25 14:11 HackGiter

https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L317

If verl is receiving data from rl_dataset and the rollout is done using AsyncRolloutRequest, then would it be enough to just modify the rl_dataset part to solve this issue, rather than making changes related to vllm or sglang? I looked through the code, and it seems that modifying this part shouldn’t be particularly difficult.

HackGiter avatar Nov 03 '25 03:11 HackGiter

it seems like AsyncRolloutRequest will update the position_ids and discard prompts['position_ids']

HackGiter avatar Nov 03 '25 06:11 HackGiter

it seems like AsyncRolloutRequest will update the position_ids and discard prompts['position_ids']

yes. AsyncRolloutRequest._get_position_ids should be updated. We can also use agent_loop instead, setting rollout.mode = async

Claude-Liu avatar Nov 10 '25 08:11 Claude-Liu