swarm-learning icon indicating copy to clipboard operation
swarm-learning copied to clipboard

errors : Unable to extract container id

Open XiaoranSU opened this issue 2 years ago • 7 comments

Issue description

  • issue description: The SWCL node task report errors while running the MNIST example. The SWOP report errors "Unable to extract container ID "
  • occurrence - consistent :
  • error messages: Unable to extract container ID H8MRBKXMBWWC0T0KLPJF0 H SJ```UYEQZEWSOFKL2E}ZA7

Swarm Learning Version: 1.00

OS and ML Platform

  • details of host OS: Ubuntu 22.04 LTS

Additional notes

  • Are you running documented example without any modification? Only modify the IP in the profile files under folder.

XiaoranSU avatar Jun 22 '22 11:06 XiaoranSU

I believe the cgroup version is v2 in ubuntu 22.04.

That may have affected it.

IMOKURI avatar Jun 23 '22 00:06 IMOKURI

How about the following as a tentative work-around?

I modified the run-swop script as follows

diff --git scripts/bin/run-swop scripts/bin/run-swop
index 1987e15..e8a30ba 100755
--- scripts/bin/run-swop
+++ scripts/bin/run-swop
@@ -97,6 +97,8 @@ processComponentBatchOpt()
         --swop-uid) checkAndAssign "${opt}" "${optarg}"
             re='^[0-9]+$'
             [[ ! "${optarg}" =~ ${re} ]] && error "${opt}: ${optarg}: bad user id";;
+        --cgroupns) checkAndAssign "${opt}" "${optarg}"
+            ;;
         *) unprocessedOpts+=("${origParam}"); nShift=1;;
     esac

@@ -126,6 +128,8 @@ onTrainEnd()
     envvar+=(-e "SWOP_GID=${dockerGroupId}")
     envvar+=(-e "SWOP_PROFILE=${profileFileName}")

+    [[ -n "${cgroupns}" ]] && miscDockerRunParams+=(--cgroupns ${cgroupns})
+
     cmd+=("${unprocessedOpts[@]}")
     unprocessedOpts=()

And, I added --cgroupns host to the command line option of the run-swop script and it works on Ubuntu 22.04.

IMOKURI avatar Jun 23 '22 07:06 IMOKURI

Thank you @IMOKURI for work around, if this is working then it can be used by other users. We are working on the fix and should be available in our upcoming release.

iArpanPatel avatar Jun 23 '22 07:06 iArpanPatel

How about the following as a tentative work-around?

I modified the run-swop script as follows

diff --git scripts/bin/run-swop scripts/bin/run-swop
index 1987e15..e8a30ba 100755
--- scripts/bin/run-swop
+++ scripts/bin/run-swop
@@ -97,6 +97,8 @@ processComponentBatchOpt()
         --swop-uid) checkAndAssign "${opt}" "${optarg}"
             re='^[0-9]+$'
             [[ ! "${optarg}" =~ ${re} ]] && error "${opt}: ${optarg}: bad user id";;
+        --cgroupns) checkAndAssign "${opt}" "${optarg}"
+            ;;
         *) unprocessedOpts+=("${origParam}"); nShift=1;;
     esac

@@ -126,6 +128,8 @@ onTrainEnd()
     envvar+=(-e "SWOP_GID=${dockerGroupId}")
     envvar+=(-e "SWOP_PROFILE=${profileFileName}")

+    [[ -n "${cgroupns}" ]] && miscDockerRunParams+=(--cgroupns ${cgroupns})
+
     cmd+=("${unprocessedOpts[@]}")
     unprocessedOpts=()

And, I added to the command line option of the script and it works on Ubuntu 22.04.--cgroupns host``run-swop

@IMOKURI Thank you for your help. It is indeed a problem with the Ubuntu version. Your changed script works well on Ubuntu 22.04.

XiaoranSU avatar Jun 23 '22 08:06 XiaoranSU

I have the same issue, the container id cannot be extracted.

For me the above mentioned changes are not working. I still get the same error.

On my CentOS 8 system, both version of cgroups are supported.

$ ~/git/swarm-learning > grep cgroup /proc/filesystems
nodev	cgroup
nodev	cgroup2

Any ideas?

maestro4 avatar Jul 28 '22 10:07 maestro4

Since CentOS8 uses cgroup v1 by default, I don't think this workaround is necessary.

If you still get the Unable to extract container ID error, there may be another problem. It is better to submit a new issue according to the issue template.

IMOKURI avatar Jul 28 '22 12:07 IMOKURI

In the released https://github.com/HewlettPackard/swarm-learning/releases/tag/v1.1.0, this workaround https://github.com/HewlettPackard/swarm-learning/issues/103#issuecomment-1164030572 is not required.

IMOKURI avatar Aug 05 '22 03:08 IMOKURI

Closing this issue, as it is addressed in latest 1.1.0 release.

RadhakrishnaJ avatar Aug 26 '22 13:08 RadhakrishnaJ