openfl
openfl copied to clipboard
MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE
Hi everyone. I am using MNIST with PyTorch on OpenFL. Training starts but after 1 round I have this error on the envoy:
METRIC Round 0, collaborator env_one is sending metric for task aggregated_model_validate: acc 6.595200 collaborator.py:402
[21:12:43] INFO Response code: StatusCode.UNAVAILABLE aggregator_client.py:57
INFO Attempting to connect to aggregator at localhost:59575
While on the director:
METRIC Round 0: saved the best model with score 6.595200 aggregator.py:816
[21:12:42] INFO Saving round 1 model... aggregator.py:850
./start_director.sh: line 4: 31695 Killed
And in the notebook:
_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1644268363.773899766","description":"Error received from peer ipv4:127.0.0.1:50051","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Socket closed","grpc_status":14}"
>
I am using OpenFL via ssh on a cluster. The strange thing is that It works with a simple (2conv layers and 2 fc layers) network, but now with a greater network it does not work
@CasellaJr could you please double-check that director was not killed due to some OS rules, like for example if it's not enough resources for him, OS just kills Director and that's why you are receiving connection issue?
How can I double-check?
Moreover, I was running another experiment, simpler, that I know works and I have seen that the experiment finishes successfully as stated by the director: [08:39:26] INFO Experiment "tinyimagenet_test_experiment" was finished successfully.
However, the envoy report the same error:
[08:42:29] ERROR Failed to get experiment: <_MultiThreadedRendezvous of RPC that terminated with: envoy.py:65
status = StatusCode.UNAVAILABLE
details = "Socket closed"
@CasellaJr I also got the problem. I have found that when the Milvus out of memory , the OS will kill this process.