Inconsistant final models trained by Swarm Learning

Open Ultimate-Storm opened this issue 11 months ago • 1 comments

Issue description

issue description: We observed inconsistency in the final models trained by Swarm Learning. We have two nodes involved in swarm learning. However whether we are picking the last or best checkpoint for prediction the results are significantly different from each other.
occurrence - consistent or rare: Consistent
error messages: None
commands used for starting containers:
docker logs [APLS, SPIRE, SN, SL, SWCI]: swop_u.log swci_u.log sn_u.log sl_u.log ml_u.log

Python scripts used to reproduce this problem: base_model.txt main.txt

2.2.0

Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning ) 2.2.0

details of host OS:
details of ML platform used: pytorch-lightening
details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 2 machines, 2 sl-ml node pairs

APLS server web GUI shows available Licenses?
If Multiple systems are used, can each system access every other system?
Is Password-less SSH configuration setup for all the systems?
If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
Is the user id a member of the docker group?

Are you running documented example without any modification?
Add any additional information about use case or any notes which supports for issue investigation:

Feb 29 '24 12:02 Ultimate-Storm

@Ultimate-Storm Can you please re-upload logs? I am getting 404 while downloading attached logs.

Mar 13 '24 06:03 htjain