swarm-learning icon indicating copy to clipboard operation
swarm-learning copied to clipboard

SN failed to start with swarm-learning version 1.1.0

Open h-ahmad opened this issue 2 years ago • 0 comments

Issue description

  • issue description: SN is started with old version of swarm-learning while it is failed when upgraded version 1.1.0
  • occurrence - consistent or rare: Recursive
  • error messages: Error message is attached below.

image

  • commands used for starting containers: Command is shown in above screenshot.
  • docker logs [APLS, SPIRE, SN, SL, SWCI]: Docker container not started.

Swarm Learning Version:

  • Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )

OS and ML Platform

  • details of host OS: Ubuntu 20.04
  • details of ML platform used:
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): Default as given in example.

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses? Yes
  • If Multiple systems are used, can each system access every other system? Yes
  • Is Password-less SSH configuration setup for all the systems? Yes
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
  • Is the user id a member of the docker group? Yes

Additional notes

  • Are you running documented example without any modification? Yes.
  • Every thing is by default.
  • HPE License server is working.
  • It works when I replace new folder 'swarm-learning' with the old one. But SN node is not started with the new swarm-learning folder in the directory '/opt/hpe/swarm-learning'.

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

h-ahmad avatar Sep 08 '22 12:09 h-ahmad