Inconsistent Training and Inference Tokenization Result in Multi-Turn Rollout Training
I am currently using verl for multi-turn interaction RL training and have identified two potential issues.
- There might be a problem with the usage of _req.add_assistant_message in the script verl/workers/sglang_rollout/sglang_rollout.py.
- Lines 855-857 in sglang_rollout.py may also be problematic.
These two issues may be the reason why the tokenization for training and inference is almost always misaligned during multi-turn training. I would like to ask the verl team to investigate this and confirm whether the problems I describe below are valid.
First issue
In the current sglang_rollout code, most calls to add_assistant_message are structured as follows:
_req.add_assistant_message(
self.processing_class,
content=content,
content_ids=content_ids,
)
When content_ids is not empty, the input_ids are updated by concatenating tensors. However, this approach is problematic. The content_ids are generated by the SGLang engine, which typically stops generation at the <|endoftext|> token. In contrast, proper multi-turn dialogue tokenization often appends a newline character after <|endoftext|> (for example, Qwen3 adds such a newline). This newline character is almost always omitted, causing each assistant output during the inference phase to be missing a newline.
My suggestion is to always use the following code to add the assistant's message to avoid this issue:
_req.add_assistant_message(
self.processing_class,
content=content,
)
Second issue
In lines 855-857, when self.config.skip_tokenizer_init is True (which it is by default), verl automatically sets _req.messages[-1].tool_calls to None.
parsed_tool_calls = _req.messages[-1].tool_calls
if self.config.skip_tokenizer_init:
_req.messages[-1].tool_calls = None
Setting tool_calls to None here will inevitably lead to a mismatch between the inference and training token lists during the final comparison, because the tool call tokens from the inference phase have been manually erased.
Hi all, for the second issue do you think setting skip_tokenizer_init to False would mitigate that? Or would that have other negative effects?
same issue.
I am currently using verl for multi-turn interaction RL training and have identified two potential issues.
- There might be a problem with the usage of _req.add_assistant_message in the script verl/workers/sglang_rollout/sglang_rollout.py.
- Lines 855-857 in sglang_rollout.py may also be problematic.
These two issues may be the reason why the tokenization for training and inference is almost always misaligned during multi-turn training. I would like to ask the verl team to investigate this and confirm whether the problems I describe below are valid.
First issue
In the current sglang_rollout code, most calls to add_assistant_message are structured as follows:
_req.add_assistant_message( self.processing_class, content=content, content_ids=content_ids, )When content_ids is not empty, the input_ids are updated by concatenating tensors. However, this approach is problematic. The content_ids are generated by the SGLang engine, which typically stops generation at the <|endoftext|> token. In contrast, proper multi-turn dialogue tokenization often appends a newline character after <|endoftext|> (for example, Qwen3 adds such a newline). This newline character is almost always omitted, causing each assistant output during the inference phase to be missing a newline.
My suggestion is to always use the following code to add the assistant's message to avoid this issue:
_req.add_assistant_message( self.processing_class, content=content, )Second issue
In lines 855-857, when self.config.skip_tokenizer_init is True (which it is by default), verl automatically sets _req.messages[-1].tool_calls to None.
parsed_tool_calls = _req.messages[-1].tool_calls if self.config.skip_tokenizer_init: _req.messages[-1].tool_calls = NoneSetting tool_calls to None here will inevitably lead to a mismatch between the inference and training token lists during the final comparison, because the tool call tokens from the inference phase have been manually erased.
I made the changes according to those two steps, and the inconsistency issue has been resolved. However, the reward curve during training has not shown any noticeable change.