swarm-learning
swarm-learning copied to clipboard
SN failed to start with swarm-learning version 1.1.0
Issue description
- issue description: SN is started with old version of swarm-learning while it is failed when upgraded version 1.1.0
- occurrence - consistent or rare: Recursive
- error messages: Error message is attached below.
- commands used for starting containers: Command is shown in above screenshot.
- docker logs [APLS, SPIRE, SN, SL, SWCI]: Docker container not started.
Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
OS and ML Platform
- details of host OS: Ubuntu 20.04
- details of ML platform used:
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): Default as given in example.
Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? Yes
- If Multiple systems are used, can each system access every other system? Yes
- Is Password-less SSH configuration setup for all the systems? Yes
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
- Is the user id a member of the docker group? Yes
Additional notes
- Are you running documented example without any modification? Yes.
- Every thing is by default.
- HPE License server is working.
- It works when I replace new folder 'swarm-learning' with the old one. But SN node is not started with the new swarm-learning folder in the directory '/opt/hpe/swarm-learning'.