tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

federated clients finish with execution code: -9

Open unalakunal opened this issue 1 year ago • 2 comments

Describe the bug While running the breast density federated learning challenge code with 3 clients, I get the following error in 2 of the clients (random clients at each new run) that cause them to terminate: ProcessExecutor - INFO - process finished with execution code: -9

I assume this has something to do with my GPU memory not being enough, since initially I see that almost 100% of GPU memory is used for a short amount of time via nvidia-smi, but afterwards it goes down very quickly right after I see this error in the logs.

For this instance, after I ran ./run_all_fl.sh , site-1 and site-3 are down but site-2 is running. Here are the logs for each of them

site-1:

2022-08-05 18:40:44,434 - MammoLearner - INFO - [run=1]: Finished initializing site-1
2022-08-05 18:40:44,440 - ClientRunner - INFO - [run=1]: client runner started
check status from process listener......
2022-08-05 18:40:49,443 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:40:49,444 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:40:53,537 - ProcessExecutor - INFO - process finished with execution code: -9

site-2

2022-08-05 18:40:44,466 - MammoLearner - INFO - [run=1]: Finished initializing site-2
2022-08-05 18:40:44,477 - ClientRunner - INFO - [run=1]: client runner started
check status from process listener......
2022-08-05 18:40:49,485 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:40:49,505 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:40:54,296 - Communicator - INFO - Received from fl_project server  (44761571 Bytes). getTask time: 4.676510572433472 seconds
2022-08-05 18:40:54,299 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2022-08-05 18:40:54,321 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=train, id=2b383c42-22f9-47ef-90df-d05af1c65c92
2022-08-05 18:40:54,322 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-08-05 18:40:54,323 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Client trainer got task: train
2022-08-05 18:40:54,324 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Client identity: site-2

check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
2022-08-05 18:43:22,774 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: val_acc_global_model (global_model): 0.2982456140350877
2022-08-05 18:43:22,775 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Current/Total Round: 1/3
2022-08-05 18:43:22,775 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Client identity: site-2
2022-08-05 18:43:22,819 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Local steps per epoch: 9
2022-08-05 18:43:22,820 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Local epoch site-2: 1/1 (lr=0.002)
2022-08-05 18:43:22,956 - FederatedClient - INFO - Starting to send aux messagee.
2022-08-05 18:43:22,962 - Communicator - INFO - Send AuxMessage to fl_project server
2022-08-05 18:43:25,484 - FederatedClient - INFO - Starting to send aux messagee.
2022-08-05 18:43:25,488 - Communicator - INFO - Send AuxMessage to fl_project server
2022-08-05 18:43:25,789 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: val_acc_local_model: 0.3333
2022-08-05 18:43:25,917 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Local epochs finished. Returning shareable
2022-08-05 18:43:25,918 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: finished processing task
2022-08-05 18:43:25,921 - FederatedClient - INFO - Starting to push execute result.
2022-08-05 18:43:25,990 - Communicator - INFO - Send submitUpdate to fl_project server
2022-08-05 18:43:26,024 - FederatedClient - INFO - Starting to send aux messagee.
2022-08-05 18:43:26,025 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
2022-08-05 18:43:33,585 - Communicator - INFO - Received comments: fl_project Received from site-2 (44761704 Bytes, 1659725013 seconds). SubmitUpdate time: 7.595135688781738 seconds
2022-08-05 18:43:33,593 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: result sent to server for task: name=train, id=2b383c42-22f9-47ef-90df-d05af1c65c92
2022-08-05 18:43:38,599 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:43:38,609 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:43:38,620 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.008861303329467773 seconds
2022-08-05 18:43:38,624 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-08-05 18:43:38,625 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
check status from process listener......
2022-08-05 18:43:40,627 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:43:40,629 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:43:40,637 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.006967067718505859 seconds
2022-08-05 18:43:40,640 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-08-05 18:43:40,641 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-08-05 18:43:42,643 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:43:42,646 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:43:42,655 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.006787538528442383 seconds
2022-08-05 18:43:42,657 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-08-05 18:43:42,657 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs

site-3:

2022-08-05 18:40:44,376 - MammoLearner - INFO - [run=1]: Finished initializing site-3
2022-08-05 18:40:44,377 - ClientRunner - INFO - [run=1]: client runner started
check status from process listener......
2022-08-05 18:40:49,383 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-08-05 18:40:49,391 - FederatedClient - INFO - Starting to fetch execute task.
2022-08-05 18:40:54,150 - Communicator - INFO - Received from fl_project server  (44761571 Bytes). getTask time: 4.738298416137695 seconds
2022-08-05 18:40:54,155 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2022-08-05 18:40:54,174 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=train, id=79f80ad7-146c-4947-97c6-1676da6e8a0a
2022-08-05 18:40:54,175 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=79f80ad7-146c-4947-97c6-1676da6e8a0a]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-08-05 18:40:54,175 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=79f80ad7-146c-4947-97c6-1676da6e8a0a]: Client trainer got task: train
2022-08-05 18:40:54,175 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=79f80ad7-146c-4947-97c6-1676da6e8a0a]: Client identity: site-3

check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
check status from process listener......
2022-08-05 18:43:19,524 - ProcessExecutor - INFO - process finished with execution code: -9

server:

STARTING 
WORKSPACE set to /code/fl_workspace/server/startup/..
start fl because of no pid.fl
new pid 129
2022-08-05 18:39:17,459 - FederatedServer - INFO - starting secure server at localhost:8002
deployed FL server trainer.
2022-08-05 18:39:17,466 - FedAdminServer - INFO - Starting Admin Server localhost on Port 8003
2022-08-05 18:39:17,466 - root - INFO - Server started
PYTHONPATH is :/code:/code
STARTING TRAINING
WAITING FOR 3 CLIENTS TO CONNECT...
2022-08-05 18:39:45,827 - ClientManager - INFO - Client: New client site-1@[ joined. Sent token: 9b70e875-a2f8-4217-9ca8-5058a1bf02b4.  Total clients: 1
2022-08-05 18:39:45,873 - ClientManager - INFO - Client: New client site-3@[ joined. Sent token: 092b3c9c-dbeb-437e-8f38-750e1756c7c0.  Total clients: 2
2022-08-05 18:39:50,706 - ClientManager - INFO - Client: New client site-2@[ joined. Sent token: f6f4de55-f3fb-4dd5-b602-1643bdf04086.  Total clients: 3
{'status': <APIStatus.SUCCESS: 'SUCCESS'>}
MAKING SURE CLIENTS ARE READY...
RUN TRAINING...
api.check_status(TargetType.SERVER)
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {<FLDetailKey.APP_NAME: 'app_name'>: '?', <FLDetailKey.SERVER_ENGINE_STATUS: 'server_engine_status'>: 'stopped', <FLDetailKey.REGISTERED_CLIENTS: 'registered_clients'>: 3, <FLDetailKey.STATUS_TABLE: 'status_table'>: [['CLIENT', 'TOKEN', 'LAST CONNECT TIME'], ['site-1', '9b70e875-a2f8-4217-9ca8-5058a1bf02b4', 'Fri Aug  5 18:39:45 2022'], ['site-3', '092b3c9c-dbeb-437e-8f38-750e1756c7c0', 'Fri Aug  5 18:39:45 2022'], ['site-2', 'f6f4de55-f3fb-4dd5-b602-1643bdf04086', 'Fri Aug  5 18:39:50 2022']]}, 'raw': {'time': '2022-08-05 18:40:30.873024', 'data': [{'type': 'string', 'data': 'FL_app name: ?'}, {'type': 'string', 'data': 'Engine status: stopped'}, {'type': 'string', 'data': 'Run number has not been set.'}, {'type': 'string', 'data': 'Registered clients: 3 '}, {'type': 'table', 'rows': [['CLIENT', 'TOKEN', 'LAST CONNECT TIME'], ['site-1', '9b70e875-a2f8-4217-9ca8-5058a1bf02b4', 'Fri Aug  5 18:39:45 2022'], ['site-3', '092b3c9c-dbeb-437e-8f38-750e1756c7c0', 'Fri Aug  5 18:39:45 2022'], ['site-2', 'f6f4de55-f3fb-4dd5-b602-1643bdf04086', 'Fri Aug  5 18:39:50 2022']]}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
api.set_run_number(1)
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'message': 'Created a new run folder: run_1'}}
api.upload_app("/code/configs/mammo_fedavg")
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'message': 'Created folder /code/fl_workspace/server/startup/../transfer/configs/mammo_fedavg'}}
api.deploy_app("/code/configs/mammo_fedavg", TargetType.ALL)
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'message': 'deployed app "/code/configs/mammo_fedavg" to Server'}, 'raw': {'time': '2022-08-05 18:40:31.413084', 'data': [{'type': 'string', 'data': 'deployed app "/code/configs/mammo_fedavg" to Server'}, {'type': 'table', 'rows': [['CLIENT', 'RESPONSE'], ['site-1', 'OK'], ['site-3', 'OK'], ['site-2', 'OK']]}, {'type': 'success', 'data': ''}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
api.check_status(TargetType.CLIENT)
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'client_statuses': [['CLIENT', 'APP_NAME', 'RUN_NUMBER', 'STATUS'], ['site-1', '/code/configs/mammo_fedavg', '1', 'not started'], ['site-3', '/code/configs/mammo_fedavg', '1', 'not started'], ['site-2', '/code/configs/mammo_fedavg', '1', 'not started']]}, 'raw': {'time': '2022-08-05 18:40:31.924231', 'data': [{'type': 'table', 'rows': [['CLIENT', 'APP_NAME', 'RUN_NUMBER', 'STATUS'], ['site-1', '/code/configs/mammo_fedavg', '1', 'not started'], ['site-3', '/code/configs/mammo_fedavg', '1', 'not started'], ['site-2', '/code/configs/mammo_fedavg', '1', 'not started']]}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
api.start_app(TargetType.ALL)
2022-08-05 18:40:32,834 - matplotlib.font_manager - INFO - generated new fontManager
2022-08-05 18:40:35,307 - ServerRunner - INFO - [run=1]: Server runner starting ...
2022-08-05 18:40:35,308 - ServerRunner - INFO - [run=1]: starting workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) ...
2022-08-05 18:40:35,308 - ScatterAndGather - INFO - [run=1]: Initializing ScatterAndGather workflow.
2022-08-05 18:40:35,310 - ServerRunner - INFO - [run=1]: Workflow scatter_gather_ctl (<class 'nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather'>) started
2022-08-05 18:40:35,310 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Beginning ScatterAndGather training phase.
2022-08-05 18:40:35,310 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: Round 0 started.
2022-08-05 18:40:35,311 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl]: scheduled task train
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'message': 'Server app is starting....'}, 'raw': {'time': '2022-08-05 18:40:32.434407', 'data': [{'type': 'string', 'data': 'Server app is starting....'}, {'type': 'table', 'rows': [['CLIENT', 'RESPONSE'], ['site-1', 'Start the client app...'], ['site-3', 'Start the client app...'], ['site-2', 'Start the client app...']]}, {'type': 'success', 'data': ''}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
api.check_status(TargetType.SERVER)
{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {<FLDetailKey.APP_NAME: 'app_name'>: '/code/configs/mammo_fedavg', <FLDetailKey.SERVER_ENGINE_STATUS: 'server_engine_status'>: 'started', <FLDetailKey.RUN_NUMBER: 'run_number'>: '1', <FLDetailKey.REGISTERED_CLIENTS: 'registered_clients'>: 3, <FLDetailKey.STATUS_TABLE: 'status_table'>: [['CLIENT', 'TOKEN', 'LAST CONNECT TIME'], ['site-1', '9b70e875-a2f8-4217-9ca8-5058a1bf02b4', 'Fri Aug  5 18:39:45 2022'], ['site-3', '092b3c9c-dbeb-437e-8f38-750e1756c7c0', 'Fri Aug  5 18:39:45 2022'], ['site-2', 'f6f4de55-f3fb-4dd5-b602-1643bdf04086', 'Fri Aug  5 18:39:50 2022']]}, 'raw': {'time': '2022-08-05 18:40:37.138542', 'data': [{'type': 'string', 'data': 'FL_app name: /code/configs/mammo_fedavg'}, {'type': 'string', 'data': 'Engine status: started'}, {'type': 'string', 'data': 'Current run number: 1'}, {'type': 'string', 'data': 'Registered clients: 3 '}, {'type': 'table', 'rows': [['CLIENT', 'TOKEN', 'LAST CONNECT TIME'], ['site-1', '9b70e875-a2f8-4217-9ca8-5058a1bf02b4', 'Fri Aug  5 18:39:45 2022'], ['site-3', '092b3c9c-dbeb-437e-8f38-750e1756c7c0', 'Fri Aug  5 18:39:45 2022'], ['site-2', 'f6f4de55-f3fb-4dd5-b602-1643bdf04086', 'Fri Aug  5 18:39:50 2022']]}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
api.wait_until_client_status()
2022-08-05 18:40:49,432 - FederatedServer - INFO - Fetch task requested from client: site-3 (092b3c9c-dbeb-437e-8f38-750e1756c7c0)
2022-08-05 18:40:49,446 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-3, peer_run=1]: got task request from client
2022-08-05 18:40:49,454 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-3, peer_run=1]: assigned task to client: name=train, id=79f80ad7-146c-4947-97c6-1676da6e8a0a
2022-08-05 18:40:49,455 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-3, peer_run=1, task_name=train, task_id=79f80ad7-146c-4947-97c6-1676da6e8a0a]: sent task assignment to client
2022-08-05 18:40:49,537 - FederatedServer - INFO - Return task:train to client:site-3 --- (092b3c9c-dbeb-437e-8f38-750e1756c7c0) 
2022-08-05 18:40:49,554 - FederatedServer - INFO - Fetch task requested from client: site-1 (9b70e875-a2f8-4217-9ca8-5058a1bf02b4)
2022-08-05 18:40:49,579 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-1, peer_run=1]: got task request from client
2022-08-05 18:40:49,579 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-1, peer_run=1]: assigned task to client: name=train, id=62ba320d-91e8-412e-9073-91840f6723a2
2022-08-05 18:40:49,579 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-1, peer_run=1, task_name=train, task_id=62ba320d-91e8-412e-9073-91840f6723a2]: sent task assignment to client
2022-08-05 18:40:49,659 - FederatedServer - INFO - Return task:train to client:site-1 --- (9b70e875-a2f8-4217-9ca8-5058a1bf02b4) 
2022-08-05 18:40:49,702 - FederatedServer - INFO - Fetch task requested from client: site-2 (f6f4de55-f3fb-4dd5-b602-1643bdf04086)
2022-08-05 18:40:49,702 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: got task request from client
2022-08-05 18:40:49,702 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: assigned task to client: name=train, id=2b383c42-22f9-47ef-90df-d05af1c65c92
2022-08-05 18:40:49,703 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: sent task assignment to client
2022-08-05 18:40:49,759 - FederatedServer - INFO - Return task:train to client:site-2 --- (f6f4de55-f3fb-4dd5-b602-1643bdf04086) 
2022-08-05 18:43:33,269 - FederatedServer - INFO - received update from fl_project_site-2_0 (44761704 Bytes, 1659725013 seconds)
2022-08-05 18:43:33,271 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: got result from client for task: name=train, id=2b383c42-22f9-47ef-90df-d05af1c65c92
2022-08-05 18:43:33,274 - IntimeModelSelectionHandler - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, peer_rc=OK, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: validation metric 0.2982456140350877 from client site-2
2022-08-05 18:43:33,528 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, peer_rc=OK, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: invoking result_received_cb ...
2022-08-05 18:43:33,531 - DXOAggregator - WARNING - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, peer_rc=OK, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Aggregation_weight missing for site-2 and set to default value, 1.0 This kind of message will show 10 times at most.
2022-08-05 18:43:33,582 - ScatterAndGather - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, peer_rc=OK, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: Contribution from site-2 ACCEPTED by the aggregator.
2022-08-05 18:43:33,583 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1, peer_rc=OK, task_name=train, task_id=2b383c42-22f9-47ef-90df-d05af1c65c92]: finished processing client result by scatter_gather_ctl
2022-08-05 18:43:38,617 - FederatedServer - INFO - Fetch task requested from client: site-2 (f6f4de55-f3fb-4dd5-b602-1643bdf04086)
2022-08-05 18:43:38,618 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: got task request from client
2022-08-05 18:43:38,618 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: no task currently for client - asked client to try again later
2022-08-05 18:43:38,619 - FederatedServer - INFO - Return task:__try_again__ to client:site-2 --- (f6f4de55-f3fb-4dd5-b602-1643bdf04086) 
2022-08-05 18:43:40,636 - FederatedServer - INFO - Fetch task requested from client: site-2 (f6f4de55-f3fb-4dd5-b602-1643bdf04086)
2022-08-05 18:43:40,636 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: got task request from client
2022-08-05 18:43:40,636 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: no task currently for client - asked client to try again later
2022-08-05 18:43:40,636 - FederatedServer - INFO - Return task:__try_again__ to client:site-2 --- (f6f4de55-f3fb-4dd5-b602-1643bdf04086) 
2022-08-05 18:43:42,654 - FederatedServer - INFO - Fetch task requested from client: site-2 (f6f4de55-f3fb-4dd5-b602-1643bdf04086)
2022-08-05 18:43:42,654 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: got task request from client
2022-08-05 18:43:42,654 - ServerRunner - INFO - [run=1, wf=scatter_gather_ctl, peer=site-2, peer_run=1]: no task currently for client - asked client to try again later

The last 5 lines in site-2 and last 4 lines in server will keep on being logged over and over again until I run docker kill server to stop the training.

I should add that ./run_docker_debug.sh script executes without an issue.

Environment (please complete the following information):

  • OS: Windows 11 WSL 2 Ubuntu 20.04.4 LTS
  • CUDA version: 11.6
  • GPU model: RTX 3070 8 GB

unalakunal avatar Aug 05 '22 19:08 unalakunal

Thanks for raising this issue. If gpu memory is the problem, can you try reducing the batch size here and rebuild the container before running again.

holgerroth avatar Aug 06 '22 15:08 holgerroth

@unalakunal, any updates on this issue?

holgerroth avatar Aug 16 '22 17:08 holgerroth

@holgerroth reducing batch size did not help, I still experience the same issue.

unalakunal avatar Aug 19 '22 05:08 unalakunal

I'm I right that you only have one GPU? If so, did you adjust the GPU indices in run_all_fl.sh? They should be all 0 if just one GPU.

I think what is happening is that one of your clients is not using GPU and therefore is very slow. It might hitting the task timeout causing the FL to stall.

FYI, if I set all gpu indices to 0 and use batch size 1, my GPU memory consumption is only about 3.6GB. So, you should be okay running all three clients on your card.

holgerroth avatar Aug 19 '22 15:08 holgerroth

@unalakunal, any updates on this?

holgerroth avatar Aug 26 '22 19:08 holgerroth

No, unfortunately not. I have already set the indices to 0 and batch size to 1, but it still fails with the same code. Any chance this has something to do with me using nvflare on WSL?

unalakunal avatar Sep 01 '22 15:09 unalakunal

WSL shouldn't be the problem. Others have used it as well https://github.com/Project-MONAI/tutorials/issues/870. @siomvas might be able to confirm.

holgerroth avatar Sep 01 '22 20:09 holgerroth

Maybe it's a general issue with limited computing resources (e.g. limited GPU memory). Can you try running with just one client? These are the steps

  1. Set "min_clients" to 1
  2. Change 3 to 1 on line 5 of start_server.sh
  3. [Build](https://github.com/Project-MONAI/tutorials/tree/main/federated_learning/breast_density_challenge#12-build-container the container
  4. Run ./run_all_fl.sh but comment out or remove the last two lines in the script.

holgerroth avatar Sep 01 '22 20:09 holgerroth

Hmm after checking the issue you mentioned I saw some similarities.

I also had this error in one of the sites OMP: Error #15: Initializing libomp.so, but found unknown library already initialized. and solved it via adding ENV KMP_DUPLICATE_LIB_OK='True' inside Dockerfile and os.environ['KMP_DUPLICATE_LIB_OK']='True' in mammo_learner.py .

After setting the steps you mentioned, I got these logs in site-1:

2022-09-02 09:19:21,083 - MammoLearner - INFO - [run=1]: Finished initializing site-1
2022-09-02 09:19:21,083 - ClientRunner - INFO - [run=1]: client runner started
2022-09-02 09:19:26,089 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:19:26,095 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:19:29,034 - Communicator - INFO - Received from fl_project server  (44761571 Bytes). getTask time: 2.930644989013672 seconds
2022-09-02 09:19:29,037 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2022-09-02 09:19:29,059 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=train, id=8436b023-3e35-4b54-83ea-0d89e5a9eefd
2022-09-02 09:19:29,059 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-09-02 09:19:29,059 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Client trainer got task: train
2022-09-02 09:19:29,059 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Client identity: site-1

check status from process listener......
2022-09-02 09:19:38,570 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: val_acc_global_model (global_model): 0.21311475409836064
2022-09-02 09:19:38,571 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Current/Total Round: 1/3
2022-09-02 09:19:38,571 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Client identity: site-1
2022-09-02 09:19:38,602 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Local steps per epoch: 559
2022-09-02 09:19:38,603 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Local epoch site-1: 1/1 (lr=0.002)
2022-09-02 09:19:38,651 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:19:38,656 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
check status from process listener......
check status from process listener......
2022-09-02 09:20:07,203 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:20:07,206 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:20:08,729 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:20:08,731 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:20:08,858 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: val_acc_local_model: 0.3934
2022-09-02 09:20:09,104 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: Local epochs finished. Returning shareable
2022-09-02 09:20:09,106 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: finished processing task
2022-09-02 09:20:09,110 - FederatedClient - INFO - Starting to push execute result.
2022-09-02 09:20:09,231 - Communicator - INFO - Send submitUpdate to fl_project server
2022-09-02 09:20:09,277 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:20:09,279 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
2022-09-02 09:20:13,478 - Communicator - INFO - Received comments: fl_project Received from site-1 (44761705 Bytes, 1662110413 seconds). SubmitUpdate time: 4.246363162994385 seconds
2022-09-02 09:20:13,485 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=8436b023-3e35-4b54-83ea-0d89e5a9eefd]: result sent to server for task: name=train, id=8436b023-3e35-4b54-83ea-0d89e5a9eefd
2022-09-02 09:20:18,491 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:20:18,501 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:20:18,514 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.010430097579956055 seconds
2022-09-02 09:20:18,519 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:20:18,521 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:20:20,524 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:20:20,526 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:20:20,538 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.009468317031860352 seconds
2022-09-02 09:20:20,542 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:20:20,542 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:20:22,545 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:20:22,548 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:20:22,559 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.00889897346496582 seconds
2022-09-02 09:20:22,562 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:20:22,562 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
check status from process listener......
2022-09-02 09:20:24,565 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:20:24,570 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:20:24,584 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.011002063751220703 seconds
2022-09-02 09:20:24,589 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:20:24,589 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:20:26,592 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:20:26,595 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:20:29,634 - Communicator - INFO - Received from fl_project server  (44761262 Bytes). getTask time: 3.0362167358398438 seconds
2022-09-02 09:20:29,638 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2022-09-02 09:20:29,665 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=train, id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0
2022-09-02 09:20:29,666 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-09-02 09:20:29,667 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Client trainer got task: train
2022-09-02 09:20:29,667 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Client identity: site-1
2022-09-02 09:20:32,001 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: val_acc_global_model (global_model): 0.39344262295081966
2022-09-02 09:20:32,001 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Current/Total Round: 2/3
2022-09-02 09:20:32,001 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Client identity: site-1
2022-09-02 09:20:32,028 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Local steps per epoch: 559
2022-09-02 09:20:32,029 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Local epoch site-1: 1/1 (lr=0.002)
2022-09-02 09:20:32,822 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:20:32,825 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
check status from process listener......
check status from process listener......
2022-09-02 09:21:01,370 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:01,372 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:21:02,888 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:02,891 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:21:03,605 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: val_acc_local_model: 0.3607
2022-09-02 09:21:03,748 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: Local epochs finished. Returning shareable
2022-09-02 09:21:03,749 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: finished processing task
2022-09-02 09:21:03,751 - FederatedClient - INFO - Starting to push execute result.
2022-09-02 09:21:03,853 - Communicator - INFO - Send submitUpdate to fl_project server
2022-09-02 09:21:04,405 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:04,406 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
2022-09-02 09:21:08,670 - Communicator - INFO - Received comments: fl_project Received from site-1 (44761705 Bytes, 1662110468 seconds). SubmitUpdate time: 4.81675386428833 seconds
2022-09-02 09:21:08,675 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0]: result sent to server for task: name=train, id=c833bbdf-b498-4014-bdfb-f77dccb4d0c0
2022-09-02 09:21:13,681 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:21:13,684 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:21:13,693 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.0077211856842041016 seconds
2022-09-02 09:21:13,700 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:21:13,702 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
check status from process listener......
2022-09-02 09:21:15,705 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:21:15,708 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:21:15,719 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.009313106536865234 seconds
2022-09-02 09:21:15,722 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:21:15,722 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:21:17,725 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:21:17,728 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:21:17,738 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.008515119552612305 seconds
2022-09-02 09:21:17,740 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:21:17,741 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:21:19,743 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:21:19,746 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:21:22,630 - Communicator - INFO - Received from fl_project server  (44761262 Bytes). getTask time: 2.882575273513794 seconds
2022-09-02 09:21:22,633 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2022-09-02 09:21:22,661 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=train, id=0af33632-b562-4ea2-997b-290104055229
2022-09-02 09:21:22,662 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-09-02 09:21:22,662 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Client trainer got task: train
2022-09-02 09:21:22,662 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Client identity: site-1
2022-09-02 09:21:24,810 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: val_acc_global_model (global_model): 0.36065573770491804
2022-09-02 09:21:24,810 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Current/Total Round: 3/3
2022-09-02 09:21:24,810 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Client identity: site-1
2022-09-02 09:21:24,839 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Local steps per epoch: 559
2022-09-02 09:21:24,840 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Local epoch site-1: 1/1 (lr=0.002)
2022-09-02 09:21:24,942 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:24,944 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
check status from process listener......
check status from process listener......
2022-09-02 09:21:53,489 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:53,492 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:21:55,006 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:55,008 - Communicator - INFO - Send AuxMessage to fl_project server
2022-09-02 09:21:55,127 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: val_acc_local_model: 0.3934
2022-09-02 09:21:55,288 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: Local epochs finished. Returning shareable
2022-09-02 09:21:55,288 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: finished processing task
2022-09-02 09:21:55,291 - FederatedClient - INFO - Starting to push execute result.
2022-09-02 09:21:55,350 - Communicator - INFO - Send submitUpdate to fl_project server
2022-09-02 09:21:55,521 - FederatedClient - INFO - Starting to send aux messagee.
2022-09-02 09:21:55,521 - Communicator - INFO - Send AuxMessage to fl_project server
check status from process listener......
2022-09-02 09:22:00,292 - Communicator - INFO - Received comments: fl_project Received from site-1 (44761705 Bytes, 1662110520 seconds). SubmitUpdate time: 4.9421961307525635 seconds
2022-09-02 09:22:00,295 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=train, task_id=0af33632-b562-4ea2-997b-290104055229]: result sent to server for task: name=train, id=0af33632-b562-4ea2-997b-290104055229
2022-09-02 09:22:05,301 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:22:05,304 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:22:05,314 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.008685588836669922 seconds
2022-09-02 09:22:05,318 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:22:05,319 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
check status from process listener......
2022-09-02 09:22:07,322 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:22:07,324 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:22:07,335 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.008904457092285156 seconds
2022-09-02 09:22:07,338 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:22:07,338 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:22:09,340 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:22:09,343 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:22:09,353 - Communicator - INFO - Received from fl_project server  (407 Bytes). getTask time: 0.009389400482177734 seconds
2022-09-02 09:22:09,357 - FederatedClient - INFO - pull_task completed. Task name:__try_again__ Status:True 
2022-09-02 09:22:09,357 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to try again - will try in 2 secs
2022-09-02 09:22:11,360 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:22:11,363 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:22:14,318 - Communicator - INFO - Received from fl_project server  (44761565 Bytes). getTask time: 2.950767755508423 seconds
2022-09-02 09:22:14,320 - FederatedClient - INFO - pull_task completed. Task name:validate Status:True 
2022-09-02 09:22:14,344 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=validate, id=c6185ff1-a57a-47d1-bc38-2f57a58a23be
2022-09-02 09:22:14,345 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-09-02 09:22:14,345 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: Client trainer got task: validate
2022-09-02 09:22:14,346 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: Client identity: site-1
2022-09-02 09:22:14,346 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: Evaluating model from SRV_FL_global_model.pt on site-1
check status from process listener......
check status from process listener......
2022-09-02 09:22:30,872 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: training acc (SRV_FL_global_model.pt): 0.3953488372093023
2022-09-02 09:22:33,380 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: validation acc (SRV_FL_global_model.pt): 0.39344262295081966
2022-09-02 09:22:33,380 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: Evaluation finished. Returning shareable
2022-09-02 09:22:33,381 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: finished processing task
2022-09-02 09:22:33,384 - FederatedClient - INFO - Starting to push execute result.
2022-09-02 09:22:33,385 - Communicator - INFO - Send submitUpdate to fl_project server
2022-09-02 09:22:33,394 - Communicator - INFO - Received comments: fl_project Received from site-1 (9131 Bytes, 1662110553 seconds). SubmitUpdate time: 0.008978605270385742 seconds
2022-09-02 09:22:33,396 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=c6185ff1-a57a-47d1-bc38-2f57a58a23be]: result sent to server for task: name=validate, id=c6185ff1-a57a-47d1-bc38-2f57a58a23be
check status from process listener......
2022-09-02 09:22:38,402 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:22:38,405 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:22:41,398 - Communicator - INFO - Received from fl_project server  (44761570 Bytes). getTask time: 2.990950107574463 seconds
2022-09-02 09:22:41,403 - FederatedClient - INFO - pull_task completed. Task name:validate Status:True 
2022-09-02 09:22:41,436 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: got task assignment: name=validate, id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414
2022-09-02 09:22:41,437 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: invoking task executor <class 'nvflare.app_common.executors.learner_executor.LearnerExecutor'>
2022-09-02 09:22:41,437 - LearnerExecutor - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: Client trainer got task: validate
2022-09-02 09:22:41,437 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: Client identity: site-1
2022-09-02 09:22:41,437 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: Evaluating model from SRV_best_FL_global_model.pt on site-1
check status from process listener......
2022-09-02 09:22:56,763 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: training acc (SRV_best_FL_global_model.pt): 0.3953488372093023
check status from process listener......
2022-09-02 09:22:59,221 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: validation acc (SRV_best_FL_global_model.pt): 0.39344262295081966
2022-09-02 09:22:59,221 - MammoLearner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: Evaluation finished. Returning shareable
2022-09-02 09:22:59,223 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: finished processing task
2022-09-02 09:22:59,225 - FederatedClient - INFO - Starting to push execute result.
2022-09-02 09:22:59,227 - Communicator - INFO - Send submitUpdate to fl_project server
2022-09-02 09:22:59,236 - Communicator - INFO - Received comments: fl_project Received from site-1 (9136 Bytes, 1662110579 seconds). SubmitUpdate time: 0.009012460708618164 seconds
2022-09-02 09:22:59,238 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1, task_name=validate, task_id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414]: result sent to server for task: name=validate, id=d25dbf58-3b8b-467a-bf3b-e7dcc7624414
2022-09-02 09:23:04,244 - ClientRunner - INFO - [run=1]: fetching task from server ...
2022-09-02 09:23:04,247 - FederatedClient - INFO - Starting to fetch execute task.
2022-09-02 09:23:04,258 - Communicator - INFO - Received from fl_project server  (383 Bytes). getTask time: 0.008490562438964844 seconds
2022-09-02 09:23:04,262 - FederatedClient - INFO - pull_task completed. Task name:__end_run__ Status:True 
2022-09-02 09:23:04,262 - ClientRunner - INFO - [run=1, peer=fl_project, peer_run=1]: server asked to end the run
2022-09-02 09:23:04,268 - ClientRunner - INFO - [run=1]: run method requests end run events sequence
2022-09-02 09:23:06,271 - ClientRunner - INFO - [run=1]: ABOUT_TO_END_RUN fired
check status from process listener......
2022-09-02 09:23:09,957 - ClientRunner - INFO - [run=1]: received aux request from Server to end current RUN
2022-09-02 09:23:09,958 - ClientRunner - INFO - [run=1]: ABORT (RUN) command received
2022-09-02 09:23:09,958 - ClientRunner - INFO - [run=1]: ABORT (RUN) requests end run events sequence
2022-09-02 09:23:11,441 - ClientRunner - INFO - [run=1]: END_RUN fired
2022-09-02 09:23:19,177 - ProcessExecutor - ERROR - Check_status execution exception.
2022-09-02 09:23:22,725 - ProcessExecutor - INFO - process finished with execution code: 0
2022-09-02 09:23:42,007 - ClientEngine - INFO - Client shutdown...
About to shutdown the client...
2022-09-02 09:23:45,038 - FederatedClient - INFO - Shutting down client
About to shutdown.

Note that I run it for a couple of times to make sure that logs are the same every time.

unalakunal avatar Sep 02 '22 09:09 unalakunal

Okay. So, this points to a resource issue when running all three clients. 1 client seems to be running fine. What's the GPU memory usage when you run just one client?

holgerroth avatar Sep 02 '22 15:09 holgerroth

Are you using CacheDataset or SmartCacheDataset? I found the -9 error when trying to cache the dataset and did not have enough RAM memory to do it.

Hope this helps.

diazandr3s avatar Sep 06 '22 21:09 diazandr3s

That's a very good point. We are using CacheData. @unalakunal, you can try reducing the cache_rate here.

holgerroth avatar Sep 07 '22 00:09 holgerroth