Completed rollouts > total_tasks_queued
GPT-5 analysis: 日志在 04:21~04:26 之间持续出现 vLLM “最大上下文 20480 tokens,但请求 22k~61k tokens” 的报错,所有调用都返回 400。(见 AAAA_8mcp_QwenAgent_8B_1203test1.log 中多段 ValueError) 这些超长请求来自当前 Hydra 配置:data.max_prompt_length=12288 + data.max_response_length=8192,再加上 multi-turn hermes 模板、系统提示等,引发实际 token 数远超 20480,同时又设置了 data.truncation=error,导致既不截断也不丢弃,直接把超长样本送给 vLLM。 每次 400 之后,AgentLightning 仍将该 rollout 记为“已完成”,但 reward/triplet 均为 None(日志中的大量警告)。随后同一个任务会被 re-claim 再跑一次,旧的失败结果和新的成功结果都可能滞留在 server 的 result 缓存里。 AgentModeDaemon._async_run_until_finished 不会清理 server 端过期的结果,只要 len(_completed_rollouts) >= _total_tasks_queued 就退出;因此当旧的失败结果在新一轮采样期间被取回时,局部计数会出现“Completed 33/32 tasks”这样的状态,最终在 get_train_data_batch 里触发断言。(agentlightning/verl/daemon.py) 结论 触发断言的根因是超长 prompt/response 导致 vLLM 连续返回 400,旧的失败 rollouts 没有被正确清理,与新一轮任务混在一起,造成 _completed_rollouts 数量大于 _total_tasks_queued。 建议 在向 vLLM 发请求前就做 token 级裁剪:把 data.truncation 改为 truncate(或显式在代理层截断),确保 prompt_len + response_len <= 20480。必要时调小 max_prompt_length/max_response_length 或改写 hermes 模板以控制 token。 在 clear_data_and_server 中调用 server 端的清理接口(或新加接口)以丢弃上一轮剩余的 _completed_rollouts;或者在 get_train_data_batch 前对多余 rollouts 做过滤,仅保留本轮 _task_id_to_original_sample 中出现的 id,避免 len 失衡。 若仍需保护运行,可临时把断言改成日志告警并丢弃多余 rollouts,但根本解法仍是限制上下文长度,避免产生这些失败任务。
log:
[36m(PatchedvLLMServer pid=2403114)[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 499, in _normalize_prompt_text_to_input [36m(PatchedvLLMServer pid=2403114)[0m ERROR 12-03 04:26:37 [serving_chat.py:222] return self._validate_input(request, input_ids, input_text) [36m(PatchedvLLMServer pid=2403114)[0m ERROR 12-03 04:26:37 [serving_chat.py:222] File "lib/python3.10/site-packages/vllm/entrypoints/openai/serving_engine.py", line 563, in _validate_input [36m(PatchedvLLMServer pid=2403114)[0m ERROR 12-03 04:26:37 [serving_chat.py:222] raise ValueError( [36m(PatchedvLLMServer pid=2403114)[0m ERROR 12-03 04:26:37 [serving_chat.py:222] ValueError: This model's maximum context length is 20480 tokens. However, you requested 49088 tokens in the messages, Please reduce the length of the messages. 36m(TaskRunner pid=2397748)[0m Warning: Reward is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652, will be auto-set to 0.0. [36m(TaskRunner pid=2397748)[0m Warning: Triplet is None for rollout rollout-ff673b89-b65c-47e6-a77a-125ca8770652. [36m(TaskRunner pid=2397748)[0m Completed 33/32 tasks... [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47154 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47162 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47172 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47188 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47200 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47210 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47212 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m INFO: 127.0.0.1:47226 - "GET /task HTTP/1.1" 200 OK [36m(TaskRunner pid=2397748)[0m All tasks finished.
Traceback (most recent call last):
File "lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "ib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "lib/python3.10/site-packages/agentlightning/verl/main.py", line 4, in
I think this is a known issue. Remember what rollout IDs have been sent and reject those unregistered rollouts might be a workaround. In the long run, we need to somehow cancel those stale rollouts.