openfl icon indicating copy to clipboard operation
openfl copied to clipboard

MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE

Open CasellaJr opened this issue 3 years ago • 3 comments

Hi everyone. I am using MNIST with PyTorch on OpenFL. Training starts but after 1 round I have this error on the envoy:

METRIC   Round 0, collaborator env_one is sending metric for task aggregated_model_validate: acc 6.595200                         collaborator.py:402
[21:12:43] INFO     Response code: StatusCode.UNAVAILABLE                                                                                aggregator_client.py:57
           INFO     Attempting to connect to aggregator at localhost:59575

While on the director:

METRIC   Round 0: saved the best model with score 6.595200                                                                          aggregator.py:816
[21:12:42] INFO     Saving round 1 model...                                                                                                    aggregator.py:850
./start_director.sh: line 4: 31695 Killed

And in the notebook:

_MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "{"created":"@1644268363.773899766","description":"Error received from peer ipv4:127.0.0.1:50051","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Socket closed","grpc_status":14}"
>

I am using OpenFL via ssh on a cluster. The strange thing is that It works with a simple (2conv layers and 2 fc layers) network, but now with a greater network it does not work

CasellaJr avatar Feb 07 '22 21:02 CasellaJr

@CasellaJr could you please double-check that director was not killed due to some OS rules, like for example if it's not enough resources for him, OS just kills Director and that's why you are receiving connection issue?

alexey-gruzdev avatar Feb 08 '22 06:02 alexey-gruzdev

How can I double-check? Moreover, I was running another experiment, simpler, that I know works and I have seen that the experiment finishes successfully as stated by the director: [08:39:26] INFO Experiment "tinyimagenet_test_experiment" was finished successfully. However, the envoy report the same error:

[08:42:29] ERROR    Failed to get experiment: <_MultiThreadedRendezvous of RPC that terminated with:                                                 envoy.py:65
                            status = StatusCode.UNAVAILABLE
                            details = "Socket closed"

CasellaJr avatar Feb 08 '22 08:02 CasellaJr

@CasellaJr I also got the problem. I have found that when the Milvus out of memory , the OS will kill this process.

HoHoHUB avatar May 11 '22 12:05 HoHoHUB