agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

[Over-long prompt/response length] cannot reshape tensor of 0 elements into shape [1, 0, -1, 128] because the unspecified dimension size -1 can be any value and is ambiguous

Open IsaacGHX opened this issue 4 months ago • 10 comments

When there is a VLLM serving error, such as out of the length of the max length of a tiny LLM, the output is empty, then the trace backward to the agent client, the error happens.

🖇 AgentOps: [OPENAI WRAPPER] Error in chat_completion_stream_wrapper: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 4608 tokens. However, you requested 6859 tokens (6347 in the messages, 512 in the completion). Please reduce the length of the messages or completion. None", 'type': 'BadRequestError', 'param': None, 'code': 400}

E.g. when using the costum agent in the customized cal_x example, such issue happens at the end of the first batch of the multiprocess task.

 
  File "/workspace/workspace/agent-lightning/.venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 154, in forward
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
RuntimeError: cannot reshape tensor of 0 elements into shape [1, 0, -1, 128] because the unspecified dimension size -1 can be any value and is ambiguous

...

IsaacGHX avatar Aug 13 '25 13:08 IsaacGHX

What's your expected behavior here? The server should not return 400 if the prompt is too long?

ultmaster avatar Aug 13 '25 16:08 ultmaster

It would be greatly appreciated if there were an automated process to convert empty or erroneous outputs to prevent tensor shape issues.

IsaacGHX avatar Aug 13 '25 17:08 IsaacGHX

Though after clamping the length of the prompt and strictly controlling the length of LLM, this issue can be avoided.

IsaacGHX avatar Aug 13 '25 17:08 IsaacGHX

More information.

If the agent is a chain of tools and with multi turn responses, this issue happens more frequently. I think the key issue is in the async running and waiting time with async.lock() or other locks such as file lock in the async rollout(no matter it's in training or validation).

With a test of different parameters in serving the agent.

  • Device config: A100 80G * 8, cpu core num 240.
  • Training bash: batchsize 8, microbatches_pergpu 1, rollout number 6~8, data truncation: truncate.

If serving the agent with a 20 numworks, rollout will suffer a further omit, maybe 2~3 of 6 left for each task. If it shrinks to 10, the empty issue will decrease.

Furthermore if there exists multi agent interacting during the agentlit rolling out, the overall rolling out time is rather long than the ones in calx or spider or rag examples which max timeout should I extend to further avoid the problem?

Potential Solution

Where can i add a retry attempt that if there is a empty response then resent this prompt and task again?

As a result it can ensure There wont be any omission in each batch when upgrading the gradient of the model.

IsaacGHX avatar Aug 26 '25 07:08 IsaacGHX

Hi @IsaacGHX , can you provide your script for launching agentlightning training server, e.g. train.sh for calc_x example?

Issue analysis

According to the error you reported:

maximum context length is 4608 tokens. However, you requested 6859 tokens (6347 in the messages, 512 in the completion)

This means that you limit the max context length (data.max_prompt_length + data.max_response_length you would define when launching a server) to 4608 tokens. Yet during training, a rollout sample has 6347 tokens in prompt and requesting a response with up to 512 tokens. Since the prompt is already longer than the max context length, training server which wraps a vllm server will directly reject this request. With error returned, the collected trajectory contains empty prompt, response and reward.

Retry the request is meaningless, as the request will always be rejected.

A quick solution

According to above analysis, you can directly resolve this issue by increase the two config options: data.max_prompt_length and data.max_response_length.

Pros

This is simple and intuitive, as even small modern models like Qwen/Qwen2.5-1.5B-Instruct has a context length of up 128k tokens.

Cons

The extended context length demands higher memory footprint -> more memory -> more GPUs.

An automatic solution

I made a PR https://github.com/microsoft/agent-lightning/pull/59 , where I provide a simple solution that directly filters out bad requests.

Could you please try this branch and/or provide some valuable feedback regarding this PR solution?

zxgx avatar Aug 26 '25 09:08 zxgx

Yes, I can give a detailed config with my training with the agent. It's a really minimal setting for RL training, even cannot be used in real implementation. GPUS are A100 80G

# config.yaml
env:
  OPENAI_API_KEY: "OPENAI_API_KEY"
  CUDA_VISIBLE_DEVICES: '0'
  HYDRA_FULL_ERROR: 1
  N_GPUS: 1
  BASE_MODEL: 'Qwen/Qwen2.5-7B-Instruct'
  ROLLOUT_TP_SIZE: 1
  EXPERIMENT_NAME: 'general_7B'
  PROJECT_NAME: 'AgentLightning_general'
  BASE_DATA_DIR: 'data/'
  VERBOSITY: 'DEBUG'
  N_WORKERS: 2
  ENABLE_TOOLS: ["..."]
  TOOL_ENGINE: ["..."]
  TOOL_STEPS: 3
  TEST_TEMPERATURE: 0.0
  TRAIN_TEMPERATURE: 0.7
  OCTO_OUTPUT_TYPE: "final,direct"
  AGENT_MAX_TIMEOUT: 500

python_args:
  agentlightning.port: 9999
  algorithm.adv_estimator: 'grpo'
  data.train_files: '${BASE_DATA_DIR}/combined_train.parquet'
  data.val_files: '${BASE_DATA_DIR}/aime24.parquet'
  actor_rollout_ref.rollout.tensor_model_parallel_size: '${ROLLOUT_TP_SIZE}'
  trainer.n_gpus_per_node: '${N_GPUS}'
  data.train_batch_size: 2
  actor_rollout_ref.rollout.n: 6
  actor_rollout_ref.actor.ppo_mini_batch_size: 2
  actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: 1
  actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: 1
  actor_rollout_ref.rollout.multi_turn.format: 'hermes'
  actor_rollout_ref.model.path: '${BASE_MODEL}'
  data.max_prompt_length: 8192
  data.max_response_length: 2048
  data.truncation: 'truncate'
  trainer.val_before_train: False
  actor_rollout_ref.actor.optim.lr: 1e-6
  actor_rollout_ref.model.use_remove_padding: True
  actor_rollout_ref.actor.use_kl_loss: False
  actor_rollout_ref.actor.kl_loss_coef: 0.000
  actor_rollout_ref.actor.entropy_coeff: 0.01
  actor_rollout_ref.actor.clip_ratio_low: 0.2
  actor_rollout_ref.actor.clip_ratio_high: 0.3
  actor_rollout_ref.model.enable_gradient_checkpointing: True
  actor_rollout_ref.actor.fsdp_config.param_offload: False
  actor_rollout_ref.actor.fsdp_config.optimizer_offload: False
  actor_rollout_ref.rollout.name: 'vllm'
  actor_rollout_ref.rollout.gpu_memory_utilization: 0.6
  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: 1
  actor_rollout_ref.ref.fsdp_config.param_offload: False
  algorithm.use_kl_in_reward: False
  trainer.critic_warmup: 0
  trainer.logger: ['console','wandb']
  trainer.project_name: '${PROJECT_NAME}'
  trainer.experiment_name: '${EXPERIMENT_NAME}'
  trainer.nnodes: 1
  trainer.save_freq: 2
  trainer.test_freq: 8
  trainer.total_epochs: 5

and the issue no longer bears the length exceeding but

(TaskRunner pid=2585611) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=2586206, ip=10.19.143.9
6, actor_id=d2efe5280204b9cfcb0c5cef0c000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7bada3f1d950>)                                              
(TaskRunner pid=2585611)   File "/home/ubuntu/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result 
(TaskRunner pid=2585611)     return self.__get_result()                                                                                                                
(TaskRunner pid=2585611)            ^^^^^^^^^^^^^^^^^^^                                                                                                                
(TaskRunner pid=2585611)   File "/home/ubuntu/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_r
esult                                                                                                                                                                  
(TaskRunner pid=2585611)     raise self._exception     

...
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)

and the train time GPU allocation is 60GB-71GB/80GB

IsaacGHX avatar Aug 27 '25 14:08 IsaacGHX

Hi @IsaacGHX , can you provide your script for launching agentlightning training server, e.g. train.sh for calc_x example?

Issue analysis

According to the error you reported:

maximum context length is 4608 tokens. However, you requested 6859 tokens (6347 in the messages, 512 in the completion)

This means that you limit the max context length (data.max_prompt_length + data.max_response_length you would define when launching a server) to 4608 tokens. Yet during training, a rollout sample has 6347 tokens in prompt and requesting a response with up to 512 tokens. Since the prompt is already longer than the max context length, training server which wraps a vllm server will directly reject this request. With error returned, the collected trajectory contains empty prompt, response and reward.

Retry the request is meaningless, as the request will always be rejected.

A quick solution

According to above analysis, you can directly resolve this issue by increase the two config options: data.max_prompt_length and data.max_response_length.

Pros

This is simple and intuitive, as even small modern models like Qwen/Qwen2.5-1.5B-Instruct has a context length of up 128k tokens.

Cons

The extended context length demands higher memory footprint -> more memory -> more GPUs.

An automatic solution

I made a PR #59 , where I provide a simple solution that directly filters out bad requests.

Could you please try this branch and/or provide some valuable feedback regarding this PR solution?

Many thanks for your PR solution, which can somehow mitigate the issue in terminate of the training process, but actually, it cannot fundamentally solve the problem cannot get an answer from the agent.

IsaacGHX avatar Aug 27 '25 14:08 IsaacGHX

Yes, I can give a detailed config with my training with the agent. It's a really minimal setting for RL training, even cannot be used in real implementation. GPUS are A100 80G

config.yaml

env: OPENAI_API_KEY: "OPENAI_API_KEY" CUDA_VISIBLE_DEVICES: '0' HYDRA_FULL_ERROR: 1 N_GPUS: 1 BASE_MODEL: 'Qwen/Qwen2.5-7B-Instruct' ROLLOUT_TP_SIZE: 1 EXPERIMENT_NAME: 'general_7B' PROJECT_NAME: 'AgentLightning_general' BASE_DATA_DIR: 'data/' VERBOSITY: 'DEBUG' N_WORKERS: 2 ENABLE_TOOLS: ["..."] TOOL_ENGINE: ["..."] TOOL_STEPS: 3 TEST_TEMPERATURE: 0.0 TRAIN_TEMPERATURE: 0.7 OCTO_OUTPUT_TYPE: "final,direct" AGENT_MAX_TIMEOUT: 500

python_args: agentlightning.port: 9999 algorithm.adv_estimator: 'grpo' data.train_files: '${BASE_DATA_DIR}/combined_train.parquet' data.val_files: '${BASE_DATA_DIR}/aime24.parquet' actor_rollout_ref.rollout.tensor_model_parallel_size: '${ROLLOUT_TP_SIZE}' trainer.n_gpus_per_node: '${N_GPUS}' data.train_batch_size: 2 actor_rollout_ref.rollout.n: 6 actor_rollout_ref.actor.ppo_mini_batch_size: 2 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: 1 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: 1 actor_rollout_ref.rollout.multi_turn.format: 'hermes' actor_rollout_ref.model.path: '${BASE_MODEL}' data.max_prompt_length: 8192 data.max_response_length: 2048 data.truncation: 'truncate' trainer.val_before_train: False actor_rollout_ref.actor.optim.lr: 1e-6 actor_rollout_ref.model.use_remove_padding: True actor_rollout_ref.actor.use_kl_loss: False actor_rollout_ref.actor.kl_loss_coef: 0.000 actor_rollout_ref.actor.entropy_coeff: 0.01 actor_rollout_ref.actor.clip_ratio_low: 0.2 actor_rollout_ref.actor.clip_ratio_high: 0.3 actor_rollout_ref.model.enable_gradient_checkpointing: True actor_rollout_ref.actor.fsdp_config.param_offload: False actor_rollout_ref.actor.fsdp_config.optimizer_offload: False actor_rollout_ref.rollout.name: 'vllm' actor_rollout_ref.rollout.gpu_memory_utilization: 0.6 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: 1 actor_rollout_ref.ref.fsdp_config.param_offload: False algorithm.use_kl_in_reward: False trainer.critic_warmup: 0 trainer.logger: ['console','wandb'] trainer.project_name: '${PROJECT_NAME}' trainer.experiment_name: '${EXPERIMENT_NAME}' trainer.nnodes: 1 trainer.save_freq: 2 trainer.test_freq: 8 trainer.total_epochs: 5 and the issue no longer bears the length exceeding but

(TaskRunner pid=2585611) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=2586206, ip=10.19.143.9
6, actor_id=d2efe5280204b9cfcb0c5cef0c000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7bada3f1d950>)                                              
(TaskRunner pid=2585611)   File "/home/ubuntu/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 449, in result 
(TaskRunner pid=2585611)     return self.__get_result()                                                                                                                
(TaskRunner pid=2585611)            ^^^^^^^^^^^^^^^^^^^                                                                                                                
(TaskRunner pid=2585611)   File "/home/ubuntu/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_r
esult                                                                                                                                                                  
(TaskRunner pid=2585611)     raise self._exception     

...
    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)

and the train time GPU allocation is 60GB-71GB/80GB

Thanks for the feedback. I notice that you set data.truncation: truncate in the config. Does this run with PR 59 or in current main branch? Besides, the trace stack misses all useful details. It would be helpful to identify the error if you can paste the trace stack like reported in https://github.com/microsoft/agent-lightning/pull/59#issue-3326064014

From your reponse, I summarize your expected fix below:

  1. when current prompt length > data.max_prompt_length, truncate the prompt
  2. when current response length > data.max_response_length, stop generation to save the training time.
  3. For any bad request, keep a valid trajectory in the training batch to ensure correct advantage estimation.

Feel free to correct the fix target.

I'm working on PR https://github.com/microsoft/agent-lightning/pull/59 for the truncation capability as this is currently wrapped by agentlightning server -> verl trainer -> vllm server. I'll fix this issue asap, and I appreciate it if you can give more further advices for improving the robustness of agentlightning

zxgx avatar Aug 27 '25 15:08 zxgx

Thanks for the feedback. I notice that you set data.truncation: truncate in the config. Does this run with PR 59 or in current main branch? Besides, the trace stack misses all useful details. It would be helpful to identify the error if you can paste the trace stack like reported in #59 (comment)

From your reponse, I summarize your expected fix below:

  1. when current prompt length > data.max_prompt_length, truncate the prompt
  2. when current response length > data.max_response_length, stop generation to save the training time.
  3. For any bad request, keep a valid trajectory in the training batch to ensure correct advantage estimation.

Feel free to correct the fix target.

I'm working on PR #59 for the truncation capability as this is currently wrapped by agentlightning server -> verl trainer -> vllm server. I'll fix this issue asap, and I appreciate it if you can give more further advices for improving the robustness of agentlightning

I got the exact same error with the rag_agent example, but it worked after I modified it according to this PR.

Kwen-Chen avatar Sep 07 '25 11:09 Kwen-Chen