verl icon indicating copy to clipboard operation
verl copied to clipboard

Tokenization Mismatch: tool_calls Cleared During Rollout Causes Inconsistent Training/Inference Tokenization

Open sagtanih opened this issue 4 months ago • 1 comments

System Info

----------Python Info----------
Version      : 3.10.19
Compiler     : GCC 11.2.0
Build        : ('main', 'Oct 21 2025 16:43:05')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 25.2
Directory    : /home/sagtanih/miniconda3/envs/cody-chat-eval/lib/python3.10/site-packages/pip
vllm         : not found.
sglang       : 0.5.4.post1
ray          : not found.
torch        : 2.8.0
----------verl Info-----------
No verl installed: No module named 'ray'
----------Platform Info----------
Platform     : Linux-5.10.0-36-cloud-amd64-x86_64-with-glibc2.31
system       : Linux
node         : sagtanih-gpu
release      : 5.10.0-36-cloud-amd64
version      : #1 SMP Debian 5.10.244-1 (2025-09-29)
----------Environment----------
CUDA Runtime : 12.8
CUDA compiler : Not found: [Errno 2] No such file or directory: 'nvcc'
----------System Info----------
CPU Memory      : 334.40 GB
GPU Count       : 2
GPU 1   Type    : NVIDIA A100-SXM4-80GB
GPU 1   Memory  : 80.00 GB
GPU 2   Type    : NVIDIA A100-SXM4-80GB
GPU 2   Memory  : 80.00 GB

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

  • Running the official example scripts modified for Qwen3 below:
set -x
export HYDRA_FULL_ERROR=1
ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"

python3 -m verl.trainer.main_ppo \
    --config-path="$CONFIG_PATH" \
    --config-name='gsm8k_multiturn_grpo' \
    algorithm.adv_estimator=grpo \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=True \
    actor_rollout_ref.model.path=Qwen/Qwen3-4B-Thinking-2507 \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=16 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='gsm8k_async_rl' \
    trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-async-sgl-multi-w-tool-verify-n16-2cards' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=20 \
    trainer.total_epochs=15 \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=8192 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=8192 \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=8192 \
    critic.ppo_max_token_len_per_gpu=8192 \
    critic.forward_max_token_len_per_gpu=8192 \
    data.train_files=$HOME/data/gsm8k/train.parquet \
    data.val_files=$HOME/data/gsm8k/test.parquet \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
    actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/interaction_config/gsm8k_interaction_config.yaml" \
    actor_rollout_ref.rollout.multi_turn.max_user_turns=1 \
    $@

We get the following error as documented in multi turn rollout:

Inconsistent training and inference tokenization detected. This may lead to unexpected behavior during training. Please review your chat template to determine if this is intentional. For more information, refer to the multiturn README.md.

On debugging I found the following diff most of the time between the full_prompt and current_prompt (constructed incrementally):

-The final answer is 14.<|im_end|>
+The final answer is 14.
+<tool_call>
+{"name": "calc_gsm8k_reward", "arguments": {"answer": "14"}}
+</tool_call><|im_end|>

Expected behavior

The tokenization sanity check should pass without warnings. The incremental token build (current_prompt) should match the batch reconstruction (full_prompt) when both are decoded.

sagtanih avatar Oct 29 '25 20:10 sagtanih

I have the same issue.

FabianSchuetze avatar Nov 18 '25 10:11 FabianSchuetze