agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

verl框架不可用

Open linzi687 opened this issue 1 month ago • 7 comments

agent-lighting的verl框架无法训练,是缺少什么配置吗

linzi687 avatar Nov 11 '25 15:11 linzi687

Please provide more details.

ultmaster avatar Nov 12 '25 10:11 ultmaster

text2sql task cann't train,Claude 的分析原因 最终结论

问题定位

vLLM服务层完全正常,问题出在任务调度层:

  1. ✅ 模型已加载(GPU 1: 29.4GB)
  2. ✅ vLLM服务运行正常(可以成功推理)
  3. ✅ 任务已入队(5个rollout任务)
  4. ✅ 10个AgentLoopWorker进程存在
  5. ❌ TaskRunner无法将任务分发给Worker

问题原因

Agent-Lightning框架的TaskRunner -> AgentLoopWorker任务分发机制存在bug

表现为:

  • TaskRunner不断发送 POST /wait_for_rollouts 请求
  • AgentLoopWorker进程空闲等待(CPU 12%但无实际工作)
  • 没有任何agent代码被调用(sql_agent.py中的rollout方法从未 执行)
  • 没有数据库访问、SQL查询生成等操作

这不是LLM推理问题,而是任务队列/分发系统的架构问题。


💡 建议

这是Agent-Lightning框架的已知问题或配置不兼容。建议:

  1. 向 https://github.com/microsoft/agent-lightning 报告issue
  2. 或尝试使用该框架的不同版本/分支
  3. 或等待框架更新修复此调度bug

linzi687 avatar Nov 12 '25 15:11 linzi687

Please provide logs and commands you have run.

ultmaster avatar Nov 12 '25 15:11 ultmaster

2025-11-13 11:58:20,317 [ERROR] (Process-1990274 agentlightning.execution.client_server) Runner 7 crashed; signaling stop event Traceback (most recent call last): File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/execution/client_server.py", line 190, in _execute_runner await runner(client_store, worker_id, stop_evt) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/trainer/trainer.py", line 539, in _runner_bundle runner_instance.init_worker(worker_id, store) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/runner/agent.py", line 108, in init_worker self._tracer.init_worker(worker_id) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/tracer/agentops.py", line 131, in init_worker agentops.init() # type: ignore File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentops/init.py", line 173, in init return client.init(**init_kwargs) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentops/client/client.py", line 105, in init response = self.api.v3.fetch_auth_token(self.config.api_key) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentops/client/api/versions/v3.py", line 43, in fetch_auth_token raise ApiServerException(error_msg) agentops.exceptions.ApiServerException: Authentication failed: 502

linzi687 avatar Nov 13 '25 06:11 linzi687

Traceback (most recent call last): File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/execution/client_server.py", line 386, in execute asyncio.run(self._execute_algorithm(algorithm, store, stop_evt)) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/execution/client_server.py", line 154, in _execute_algorithm await algorithm(wrapper_store, stop_evt) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/trainer/trainer.py", line 513, in _algorithm_bundle algorithm.run( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/algorithm/verl/interface.py", line 136, in run run_ppo( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 50, in run_ppo ray.get( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/worker.py", line 2961, in get values, debugger_breakpoint = worker.get_objects( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/worker.py", line 1028, in get_objects raise value ray.exceptions.ActorUnavailableError: The actor eed8230635b99f3c816caef401000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: Socket closed; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor. 2025-11-12 22:48:42,040 [INFO] (Process-1676398 agentlightning.execution.client_server) Shutting down subprocesses 2025-11-12 22:48:42,041 [WARNING] (Process-1676398 agentlightning.execution.client_server) Subprocesses ended abnormally during shutdown: Subprocesses failed: runner-0 (exitcode=1), runner-1 (exitcode=1), runner-2 (exitcode=1), runner-3 (exitcode=1), runner-4 (exitcode=1), runner-5 (exitcode=1), runner-6 (exitcode=1), runner-7 (exitcode=1), runner-8 (exitcode=1), runner-9 (exitcode=1) [33m(raylet)[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms [state-dump] CoreWorkerService.grpc_client.ActorCallArgWaitComplete.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.04ms, max = 0.04ms, min = 0.04ms, total = 0.04ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeAddressAndLiveness - 1 total (0 active), Execution time: mean = 0.69ms, total = 0.69ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] NodeManagerService.grpc_server.CommitBundleResources.HandleRequestImpl - 1 total (0 active), Execution time: mean = 0.16ms, total = 0.16ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms [state-dump] NodeManagerService.grpc_server.PinObjectIDs.HandleRequestImpl - 1 total (0 active), Execution time: mean = 0.56ms, total = 0.56ms, Queueing time: mean = 0.22ms, max = 0.22ms, min = 0.22ms, total = 0.22ms [state-dump] NodeManagerService.grpc_server.ReturnWorkerLease.HandleRequestImpl - 1 total (0 active), Execution time: mean = 0.15ms, total = 0.15ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.13ms, total = 1.13ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] NodeManagerService.grpc_server.PrepareBundleResources - 1 total (0 active), Execution time: mean = 0.39ms, total = 0.39ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] ray::rpc::JobInfoGcsService.grpc_client.AddJob.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.05ms, total = 0.05ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.02ms, total = 0.02ms [state-dump] NodeManagerService.grpc_server.PinObjectIDs - 1 total (0 active), Execution time: mean = 1.16ms, total = 1.16ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] CoreWorkerService.grpc_client.ActorCallArgWaitComplete - 1 total (0 active), Execution time: mean = 1.21ms, total = 1.21ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeAddressAndLiveness.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.12ms, total = 0.12ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms [state-dump] NodeManagerService.grpc_server.CommitBundleResources - 1 total (0 active), Execution time: mean = 0.32ms, total = 0.32ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 0.62ms, total = 0.62ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.01ms, total = 1.01ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 38.89ms, total = 38.89ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms [state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 1 total (0 active), Execution time: mean = 4.47ms, total = 4.47ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump]

Traceback (most recent call last): File "/root/lindsay/agent-lightning/examples/spider/train_sql_agent.py", line 203, in File "/root/lindsay/agent-lightning/examples/spider/train_sql_agent.py", line 199, in main if name == "main": File "/root/lindsay/agent-lightning/examples/spider/train_sql_agent.py", line 167, in train def main() -> None: File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/trainer/trainer.py", line 424, in fit self.strategy.execute(algorithm_bundle, runner_bundle, self.store) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/execution/client_server.py", line 386, in execute asyncio.run(self._execute_algorithm(algorithm, store, stop_evt)) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/execution/client_server.py", line 154, in _execute_algorithm await algorithm(wrapper_store, stop_evt) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/trainer/trainer.py", line 513, in _algorithm_bundle algorithm.run( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/algorithm/verl/interface.py", line 136, in run run_ppo( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/agentlightning/verl/entrypoint.py", line 50, in run_ppo ray.get( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/worker.py", line 2961, in get values, debugger_breakpoint = worker.get_objects( File "/root/miniconda3/envs/financial_text2sql/lib/python3.10/site-packages/ray/_private/worker.py", line 1028, in get_objects raise value ray.exceptions.ActorUnavailableError: The actor eed8230635b99f3c816caef401000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: Socket closed; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor.

linzi687 avatar Nov 13 '25 07:11 linzi687

@linzi687 this issue is not agent lighting problem is a connection problem agentops.exceptions.ApiServerException: Authentication failed: 502 You seems not handling the authentication properly and

unavailable: RpcError: RPC Error message: Socket closed; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor.

this also verify my thought.

JasonHonKL avatar Nov 22 '25 16:11 JasonHonKL

instrument_agentops() might have failed because agentlightning.instrumentation.agentops fails to import. Please verify whether you can import that module.

ultmaster avatar Nov 29 '25 13:11 ultmaster