swarm-learning
swarm-learning copied to clipboard
errors : Unable to extract container id
Issue description
- issue description: The SWCL node task report errors while running the MNIST example. The SWOP report errors "Unable to extract container ID "
- occurrence - consistent :
- error messages: Unable to extract container ID
Swarm Learning Version: 1.00
OS and ML Platform
- details of host OS: Ubuntu 22.04 LTS
Additional notes
- Are you running documented example without any modification? Only modify the IP in the profile files under folder.
I believe the cgroup version is v2 in ubuntu 22.04.
That may have affected it.
How about the following as a tentative work-around?
I modified the run-swop script as follows
diff --git scripts/bin/run-swop scripts/bin/run-swop
index 1987e15..e8a30ba 100755
--- scripts/bin/run-swop
+++ scripts/bin/run-swop
@@ -97,6 +97,8 @@ processComponentBatchOpt()
--swop-uid) checkAndAssign "${opt}" "${optarg}"
re='^[0-9]+$'
[[ ! "${optarg}" =~ ${re} ]] && error "${opt}: ${optarg}: bad user id";;
+ --cgroupns) checkAndAssign "${opt}" "${optarg}"
+ ;;
*) unprocessedOpts+=("${origParam}"); nShift=1;;
esac
@@ -126,6 +128,8 @@ onTrainEnd()
envvar+=(-e "SWOP_GID=${dockerGroupId}")
envvar+=(-e "SWOP_PROFILE=${profileFileName}")
+ [[ -n "${cgroupns}" ]] && miscDockerRunParams+=(--cgroupns ${cgroupns})
+
cmd+=("${unprocessedOpts[@]}")
unprocessedOpts=()
And, I added --cgroupns host
to the command line option of the run-swop
script and it works on Ubuntu 22.04.
Thank you @IMOKURI for work around, if this is working then it can be used by other users. We are working on the fix and should be available in our upcoming release.
How about the following as a tentative work-around?
I modified the run-swop script as follows
diff --git scripts/bin/run-swop scripts/bin/run-swop index 1987e15..e8a30ba 100755 --- scripts/bin/run-swop +++ scripts/bin/run-swop @@ -97,6 +97,8 @@ processComponentBatchOpt() --swop-uid) checkAndAssign "${opt}" "${optarg}" re='^[0-9]+$' [[ ! "${optarg}" =~ ${re} ]] && error "${opt}: ${optarg}: bad user id";; + --cgroupns) checkAndAssign "${opt}" "${optarg}" + ;; *) unprocessedOpts+=("${origParam}"); nShift=1;; esac @@ -126,6 +128,8 @@ onTrainEnd() envvar+=(-e "SWOP_GID=${dockerGroupId}") envvar+=(-e "SWOP_PROFILE=${profileFileName}") + [[ -n "${cgroupns}" ]] && miscDockerRunParams+=(--cgroupns ${cgroupns}) + cmd+=("${unprocessedOpts[@]}") unprocessedOpts=()
And, I added to the command line option of the script and it works on Ubuntu 22.04.
--cgroupns host``run-swop
@IMOKURI Thank you for your help. It is indeed a problem with the Ubuntu version. Your changed script works well on Ubuntu 22.04.
I have the same issue, the container id cannot be extracted.
For me the above mentioned changes are not working. I still get the same error.
On my CentOS 8 system, both version of cgroups are supported.
$ ~/git/swarm-learning > grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
Any ideas?
Since CentOS8 uses cgroup v1 by default, I don't think this workaround is necessary.
If you still get the Unable to extract container ID
error, there may be another problem.
It is better to submit a new issue according to the issue template.
In the released https://github.com/HewlettPackard/swarm-learning/releases/tag/v1.1.0, this workaround https://github.com/HewlettPackard/swarm-learning/issues/103#issuecomment-1164030572 is not required.
Closing this issue, as it is addressed in latest 1.1.0 release.