FedML Can client_num_in_total only be set to 4 in FedGraphGNN's Link prediction?

When client_num_in_total is set to 4, it can be trained normally, and the evaluation index is also normal, but when I set it to other numbers (such as 6), the evaluation index will be very abnormal, Test MAE = 20714260267008.0, mae = 20714260267008.0, rmse = 20714260267008.0, mse = 4.290806749939617 e+26, obviously unreasonable。 The error message is as follows： `======== FedML (https://fedml.ai) ======== FedML version: 0.7.286 Execution path:/root/miniconda3/lib/python3.8/site-packages/fedml/init.py

======== Running Environment ======== OS: Linux-5.4.0-96-generic-x86_64-with-glibc2.17 Hardware: x86_64 Python version: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] PyTorch version: 1.11.0 MPI4py is installed

======== CPU Configuration ======== The CPU usage is : 6% Available CPU Memory: 332.5 G / 376.05326080322266G

======== GPU Configuration ======== NVIDIA GPU Info: <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f74b73674c0> Available GPU memory: 10.8 G / 11.0G [] args.client_id_list = None args.client_id_list is not None Epoch = 0, Iter = 1/1: Test score = 3.0601377487182617 Current best = 0 Epoch = 1, Iter = 1/1: Test score = 14.168010711669922 Current best = 0 Epoch = 2, Iter = 1/1: Test score = 222.92408752441406 Current best = 0 Epoch = 3, Iter = 1/1: Test score = 179197.703125 Current best = 0 Epoch = 4, Iter = 1/1: Test score = 82857024290816.0 Current best = 0 Epoch = 0, Iter = 1/1: Test score = inf Current best = 0 [FedML-Server(0) @device-id-0] [Fri, 05 Aug 2022 13:37:13] [ERROR] [mlops_runtime_log.py:34:handle_exception] Uncaught exception Traceback (most recent call last): File "fedml_subgraph_link_prediction.py", line 84, in fedml_runner = FedMLRunner(args, device, dataset, model, trainer, aggregator) File "/root/miniconda3/lib/python3.8/site-packages/fedml/runner.py", line 40, in init self.runner = init_runner_func( File "/root/miniconda3/lib/python3.8/site-packages/fedml/runner.py", line 56, in _init_simulation_runner runner = SimulatorMPI( File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/simulator.py", line 81, in init self.simulator = FedML_FedAvg_distributed( File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAvgAPI.py", line 54, in FedML_FedAvg_distributed init_client( File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAvgAPI.py", line 139, in init_client client_manager.run() File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAvgClientManager.py", line 26, in run super().run() File "/root/miniconda3/lib/python3.8/site-packages/fedml/core/distributed/fedml_comm_manager.py", line 28, in run self.com_manager.handle_receive_message() File "/root/miniconda3/lib/python3.8/site-packages/fedml/core/distributed/communication/mpi/com_manager.py", line 100, in handle_receive_message self.notify(msg_params) File "/root/miniconda3/lib/python3.8/site-packages/fedml/core/distributed/communication/mpi/com_manager.py", line 122, in notify observer.receive_message(msg_type, msg_params) File "/root/miniconda3/lib/python3.8/site-packages/fedml/core/distributed/fedml_comm_manager.py", line 40, in receive_message handler_callback_func(msg_params) File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAvgClientManager.py", line 64, in handle_message_receive_model_from_server self.__train() File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAvgClientManager.py", line 81, in __train weights, local_sample_num = self.trainer.train(self.round_idx) File "/root/miniconda3/lib/python3.8/site-packages/fedml/simulation/mpi/fedavg/FedAVGTrainer.py", line 41, in train self.trainer.train(self.train_local, self.device, self.args) File "/root/FedML-master/python/app/fedgraphnn/recsys_subgraph_link_pred/trainer/fed_subgraph_lp_trainer.py", line 61, in train test_score, _, _, _, _ = self.test( File "/root/FedML-master/python/app/fedgraphnn/recsys_subgraph_link_pred/trainer/fed_subgraph_lp_trainer.py", line 98, in test score = metric(link_labels.cpu(), link_logits.cpu()) File "/root/miniconda3/lib/python3.8/site-packages/sklearn/metrics/_regression.py", line 191, in mean_absolute_error y_type, y_true, y_pred, multioutput = _check_reg_targets( File "/root/miniconda3/lib/python3.8/site-packages/sklearn/metrics/_regression.py", line 96, in _check_reg_targets y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype) File "/root/miniconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 800, in check_array _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan") File "/root/miniconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite raise ValueError( ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). `

Aug 05 '22 06:08 Moonriver1126

Hello,

May I ask what are your hyperparameters? Is it possible to try LR = 0.005. I believe that the problem may be caused because of having high learning rate like 0.01.

We're investigating it though.

Thanks!

Aug 11 '22 22:08 emirceyani

Hello,

May I ask what are your hyperparameters? Is it possible to try LR = 0.005. I believe that the problem may be caused because of having high learning rate like 0.01.

We're investigating it though.

Thanks!

Hello，

You are right! Now it works!

Thanks!

Aug 13 '22 14:08 Moonriver1126

Hello,

May I ask what are your hyperparameters? Is it possible to try LR = 0.005. I believe that the problem may be caused because of having high learning rate like 0.01.

We're investigating it though.

Thanks!

Hello, I would like to ask a question about the partitioning of the ciao dataset. The first question: There are 28 categories of items in the ciao dataset. If I create 28 clients, does the item representing each client correspond to a category of items? For example, client 0, including items of category 0, and users interacting with it, constitute the local subgraph of client 1. The second question: the id of the client, does the id of the client correspond to the category of the item, for example: if the current ciao dataset has the 2nd and 10th categories (extracted from the original ciao dataset), then client 0 represents the 2nd category Class item, client 1 represents the 10th class item, is my understanding correct? Hope to hear back, thanks!

Aug 15 '22 09:08 Moonriver1126

Please refer to #464

Aug 22 '22 05:08 YangLiangwei

Issue has been resolved.

Oct 25 '23 01:10 fedml-dimitris