agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

[Rollout timeout] Loss rollout while training

Open XianglongTan opened this issue 2 months ago • 4 comments

The Error traceback:

File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/entrypoint.py", line 152, in run
    trainer.fit()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/trainer.py", line 318, in fit
    metrics = self._train_step(batch_dict)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/trainer.py", line 95, in _train_step
    batch, agent_metrics = self.agent_mode_daemon.get_train_data_batch(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/agentlightning/verl/daemon.py", line 379, in get_train_data_batch
    original_sample = self._task_id_to_original_sample[rollout_id]
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^

training log

(Process-11615 agentlightning.server) Requeuing task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 after timeout (attempt 1)
(Process-11615 agentlightning.server) Task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 timed out after 600.0s, requeued (attempt 1)
(Process-11615 agentlightning.server) Task rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284 re-claimed (attempt 2)
(Process-11615 agentlightning.server) Rollout received and stored: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

agent log

[Task 10133 Received] ID: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

[Task 10190 Received] ID: rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284

2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)   [Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Message length details:
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 0: 2633 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 1: 3002 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 2: 176 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 3: 3013 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 4: 323 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Message 5: 4113 characters
2025-10-27 02:44:52,426 [INFO] (Process-1116 __main__)     Total: 6 messages, 13260 characters

(Process-1116 agentlightning.runner)   [Worker 3 | Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Completed in 25.88s. Triplet length: 4. Reward: 0.0

2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)   [Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Message length details:
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 0: 2633 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 1: 4985 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 2: 265 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 3: 3013 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 4: 412 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Message 5: 4444 characters
2025-10-27 02:59:33,022 [INFO] (Process-1113 __main__)     Total: 6 messages, 15752 characters

(Process-1113 agentlightning.runner)   [Worker 0 | Rollout rollout-85c3e463-cf45-4dae-a765-c5bf6cc59284] Completed in 1505.21s. Triplet length: 4. Reward: 0.0

I guess the server raise timeout error bcz agent takes too much time to finish task. I suggest that if time out, just ignore that rollout.

BTW, is there any wechat group or rednote group?

XianglongTan avatar Oct 27 '25 02:10 XianglongTan

In v0.2, there is a RolloutConfig controlling that behavior.

You can join the discord group, which is on the frontpage of this project.

ultmaster avatar Oct 27 '25 08:10 ultmaster

我可以建一个非官方的微信或者小红书群吗?discord用不太习惯 我在小红书上发了篇讨论 agent lightning 的文章,有2K多阅读和500多赞藏,我也希望大家能在国内的平台更方便地进行讨论~

XianglongTan avatar Oct 28 '25 09:10 XianglongTan

I'll ask the team if anyone is willing to maintain an official group. Maintaining and refreshing an invitation QR code for a WeChat group would require heavier efforts than Discord group, and it's not friendly to non-Chinese individuals.

Thanks a lot for your efforts in promoting Agent-lightning to a broader community. Please feel free to initiate any unofficial discussion group you feel passionate about.

ultmaster avatar Oct 28 '25 15:10 ultmaster

#236 建好啦!

XufangLuo avatar Oct 29 '25 01:10 XufangLuo