KubeFATE
KubeFATE copied to clipboard
<_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL
OS: ubuntu-18.04.6 Memory: 16G + 128G swap CPUs: 8 Disk: 500G Docker: 20.10.17 Docker-Compose: 2.10.2 FATE: 1.9.0
我是按照 https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md 这份帖子操作的,在进行到 <验证部署> 环节的时候 Host 机器报出了错误。内容如下:
{
"jobId": "202209100304586271510",
"retcode": 103,
"retmsg": "Traceback (most recent call last):
File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit
raise Exception(\"create job failed\", response)
Exception: ('create job failed', {'guest': {10000: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\
\\tstatus = StatusCode.INTERNAL\
\\tdetails = \"\
[Roll Site Error TransInfo] \
location msg=INTERNAL: HTTP/2 error code: INTERNAL_ERROR\
Received Goaway\
Could not initialize class io.grpc.InternalServer \
stack info=io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR\
Received Goaway\
Could not initialize class io.grpc.InternalServer\
\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\
\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\
\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\
\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\
\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\
\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\
\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\
\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\
\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\
\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\
\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\
\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\
\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\
\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\
\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\
\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\
\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\
\\tat java.lang.Thread.run(Thread.java:750)\
\
\
exception trans path: rollsite(10000)\"\
\\tdebug_error_string = \"{\"created\":\"@1662779104.941951106\",\"description\":\"Error received from peer ipv4:192.167.0.4:9370\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"\\\
[Roll Site Error TransInfo] \\\
location msg=INTERNAL: HTTP/2 error code: INTERNAL_ERROR\\\
Received Goaway\\\
Could not initialize class io.grpc.InternalServer \\\
stack info=io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR\\\
Received Goaway\\\
Could not initialize class io.grpc.InternalServer\\\
\\\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\\\
\\\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\\\
\\\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\\\
\\\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\\
\\\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\\
\\\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\\
\\\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\\\
\\\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\\
\\\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\\
\\\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\\
\\\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\\
\\\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\\
\\\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\\\
\\\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\\
\\\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\\
\\\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\\
\\\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\\
\\\\tat java.lang.Thread.run(Thread.java:750)\\\
\\\
\\\
exception trans path: rollsite(10000)\",\"grpc_status\":13}\"\
>'}}})
"
}
parties.conf
You run the
docker exec -it confs-10000_client_1 bash
flow test toy --guest-party-id 10000 --host-party-id 9999
In the notebook container on party 10000, which is 192.168.191.128 right?
I cannot reproduce this issue in my enviroment, my job succeeded.
In the notebook container on party 10000, which is 192.168.191.128 right?
yes, the party 10000's ip is 192.168.191.128.
I cannot reproduce this issue in my enviroment, my job succeeded.
What factors may have caused this to happen?
I cannot reproduce this issue in my enviroment, my job succeeded.
What factors may have caused this to happen?
I confirmed with eggroll folks, this could be network issue. Mayve you can try to disable the firewalld on your vm.
I would also suggest you to use K8s to deploy a 3-party federation, using fate exchange as the center.
I also meet this error when deploy the 3-party federation.
The fate file in my two client machines are in the folder, /data/data/projects/fate/ (this may because there already exists the /data/projects/ in my machine, thus fate create the data folder in data)
I think you do some clearn up works in the 3 machines then retry.
cd /data/projects/fate/
cd confs-9999
docker-compose down
cd ../serving-9999
docker-copose down
docker volume rm $(docker volume ls -q | grep 9999) && docker network rm confs-9999_fate-network
Do above steps for all your party IDs, here 9999 is just an example.
Then reinstall the 3 parties.
请问楼主解决了吗 我也遇到了同样的问题
没解决,最后我重新在CentOS-7上部了一次, 直接成功....
没解决,最后我重新在CentOS-7上部了一次, 直接成功....
我的问题好像和你的不一样 我的是方法找不到 docker-compose部署fate 1.9版本,一方单边测试成功,一方单边测试失败,报错:java.lang.NoSuchMethodError: com.google.protobuf.GeneratedMessageV3.isStringEmpty(Ljava/lang/Object;)Z\n\tat com.webank.ai.eggroll.api.networking.proxy.Proxy$Task.getSerializedSize
https://github.com/FederatedAI/KubeFATE/issues/766