KubeFATE icon indicating copy to clipboard operation
KubeFATE copied to clipboard

<_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL

Open SnakeCN21 opened this issue 2 years ago • 11 comments

OS: ubuntu-18.04.6 Memory: 16G + 128G swap CPUs: 8 Disk: 500G Docker: 20.10.17 Docker-Compose: 2.10.2 FATE: 1.9.0

我是按照 https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md 这份帖子操作的,在进行到 <验证部署> 环节的时候 Host 机器报出了错误。内容如下:

{
    "jobId": "202209100304586271510",
    "retcode": 103,
    "retmsg": "Traceback (most recent call last):
      File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit
        raise Exception(\"create job failed\", response)
        Exception: ('create job failed', {'guest': {10000: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\
    \\tstatus = StatusCode.INTERNAL\
    \\tdetails = \"\
    [Roll Site Error TransInfo] \
     location msg=INTERNAL: HTTP/2 error code: INTERNAL_ERROR\
     Received Goaway\
    Could not initialize class io.grpc.InternalServer \
     stack info=io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR\
     Received Goaway\
    Could not initialize class io.grpc.InternalServer\
    \\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\
    \\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\
    \\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\
    \\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\
    \\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\
    \\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\
    \\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\
    \\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\
    \\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\
    \\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\
    \\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\
    \\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\
    \\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\
    \\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\
    \\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\
    \\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\
    \\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\
    \\tat java.lang.Thread.run(Thread.java:750)\
     \
     \
    exception trans path: rollsite(10000)\"\
    \\tdebug_error_string = \"{\"created\":\"@1662779104.941951106\",\"description\":\"Error received from peer ipv4:192.167.0.4:9370\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"\\\
    [Roll Site Error TransInfo] \\\
     location msg=INTERNAL: HTTP/2 error code: INTERNAL_ERROR\\\
     Received Goaway\\\
    Could not initialize class io.grpc.InternalServer \\\
     stack info=io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR\\\
     Received Goaway\\\
    Could not initialize class io.grpc.InternalServer\\\
    \\\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:240)\\\
    \\\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:221)\\\
    \\\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:140)\\\
    \\\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\\
    \\\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\\
    \\\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\\
    \\\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)\\\
    \\\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\\
    \\\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\\
    \\\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\\
    \\\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\\
    \\\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\\
    \\\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)\\\
    \\\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\\
    \\\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\\
    \\\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\\
    \\\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\\
    \\\\tat java.lang.Thread.run(Thread.java:750)\\\
     \\\
     \\\
    exception trans path: rollsite(10000)\",\"grpc_status\":13}\"\
    >'}}})
    "
}

error

SnakeCN21 avatar Sep 11 '22 11:09 SnakeCN21

parties.conf parties conf

SnakeCN21 avatar Sep 11 '22 11:09 SnakeCN21

You run the

docker exec -it confs-10000_client_1 bash
flow test toy --guest-party-id 10000 --host-party-id 9999

In the notebook container on party 10000, which is 192.168.191.128 right?

JingChen23 avatar Sep 13 '22 03:09 JingChen23

I cannot reproduce this issue in my enviroment, my job succeeded.

JingChen23 avatar Sep 13 '22 03:09 JingChen23

In the notebook container on party 10000, which is 192.168.191.128 right?

yes, the party 10000's ip is 192.168.191.128.

SnakeCN21 avatar Sep 13 '22 09:09 SnakeCN21

I cannot reproduce this issue in my enviroment, my job succeeded.

What factors may have caused this to happen?

SnakeCN21 avatar Sep 13 '22 09:09 SnakeCN21

I cannot reproduce this issue in my enviroment, my job succeeded.

What factors may have caused this to happen?

I confirmed with eggroll folks, this could be network issue. Mayve you can try to disable the firewalld on your vm.

I would also suggest you to use K8s to deploy a 3-party federation, using fate exchange as the center.

JingChen23 avatar Sep 14 '22 03:09 JingChen23

I also meet this error when deploy the 3-party federation. The fate file in my two client machines are in the folder, /data/data/projects/fate/ (this may because there already exists the /data/projects/ in my machine, thus fate create the data folder in data) image

cuhkluobo avatar Sep 14 '22 06:09 cuhkluobo

I think you do some clearn up works in the 3 machines then retry.

cd /data/projects/fate/
cd confs-9999
docker-compose down
cd ../serving-9999
docker-copose down
docker volume rm $(docker volume ls -q | grep 9999) && docker network rm confs-9999_fate-network

Do above steps for all your party IDs, here 9999 is just an example.

Then reinstall the 3 parties.

JingChen23 avatar Sep 19 '22 02:09 JingChen23

请问楼主解决了吗 我也遇到了同样的问题

HelloJeremy avatar Sep 24 '22 08:09 HelloJeremy

没解决,最后我重新在CentOS-7上部了一次, 直接成功....

SnakeCN21 avatar Sep 27 '22 07:09 SnakeCN21

没解决,最后我重新在CentOS-7上部了一次, 直接成功....

我的问题好像和你的不一样 我的是方法找不到 docker-compose部署fate 1.9版本,一方单边测试成功,一方单边测试失败,报错:java.lang.NoSuchMethodError: com.google.protobuf.GeneratedMessageV3.isStringEmpty(Ljava/lang/Object;)Z\n\tat com.webank.ai.eggroll.api.networking.proxy.Proxy$Task.getSerializedSize

https://github.com/FederatedAI/KubeFATE/issues/766

HelloJeremy avatar Sep 27 '22 08:09 HelloJeremy