FATE-LLM
FATE-LLM copied to clipboard
hello 我在使用 deepspeed运行chatglm出现以下错误 resource request gpu count 2 is too big,请问如何解决呢?
出现以下错误gpu请求过大,实际上我有两块GPU。同时,我使用的是python提交命令,而非jupyter
[ERROR] [2023-10-07 22:58:01,892] [202310072257522498440] [22816:139678211446592] - [deepspeed_utils._run] [line:67]: failed to call CommandURI(_uri=v1/cluster-manager/job/submitJob) to xxx.xxx.xxx.xxx:4670: <_InactiveRpcError of RPC that terminated with: 2 status = StatusCode.INTERNAL 3 details = "xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big 4 at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237) 5 at com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226) 6 at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131) 7 at com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131) 8 at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139) 9 at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47) 10 at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41) 11 at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) 12 at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) 13 at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257) 14 at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) 15 at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) 16 at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) 17 at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) 18 at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) 19 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 20 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 21 at java.lang.Thread.run(Thread.java:750) 22 " 23 debug_error_string = "{"created":"@1696690681.818528053","description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:4670","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"xxx.xxx.xxx.xxx:4670: com.webank.eggroll.core.error.ErSessionException: resource request gpu count 2 is too big\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleDeepspeedSubmit(JobServiceHandler.scala:237)\n\tat com.webank.eggroll.core.deepspeed.job.JobServiceHandler$.handleSubmit(JobServiceHandler.scala:226)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.resourcemanager.ClusterManagerBootstrap$$anonfun$init$1.apply(ClusterManagerBootstrap.scala:131)\n\tat com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139)\n\tat
您可以加微信forgive_dengkai ,我们了解下详情