clustershell
clustershell copied to clipboard
Node names get mangled when a node name contains one or more 0 characters
We have two nodes listserv01 and listserv3, and the second node name listserv3 gets mangled when taken together with the listserv01 node:
$ clush -w listserv01,listserv3 uname -r listserv03: 2.6.32-754.30.2.el6.x86_64 listserv01: 4.18.0-193.6.3.el8_2.x86_64
When taken alone there is no issue with the listserv3 node name:
$ clush -w listserv3 uname -r listserv3: 2.6.32-754.30.2.el6.x86_64
(Note: I had to define a DNS CNAME alias listserv03 pointing to listserv3 as a workaround).
The incorrect leading zero gets added to all subsequent names starting with "listserv":
$ clush -w listserv01,listserv3,listserv2 uname -r listserv02: ssh: Could not resolve hostname listserv02: Name or service not known clush: listserv02: exited with exit code 255 listserv03: 2.6.32-754.30.2.el6.x86_64 listserv01: 4.18.0-193.6.3.el8_2.x86_64
Yet another example with "00":
$ clush -w a001,a2 uname -r a001: 3.10.0-1127.8.2.el7.x86_64 a002: 3.10.0-1127.8.2.el7.x86_64
So it seems that the presence of the "0" character triggers the present bug, where the zeroes get added incorrectly to other node names in the list.
We use this EPEL6 package: clustershell-1.8.3-1.el6_10.noarch and this EPEL7 package: clustershell-1.8.3-1.el7.noarch and this Fedora FC32 package: clustershell-1.8.3-2.fc32.noarch
I think your ticket is a duplicate of #293 .
Thanks! This is a pretty surprising bug.
I've added references to this issue in my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell
fixed in https://github.com/cea-hpc/clustershell/commit/5a41bc09f70309600c1a407d2bb3dd08f5d1ba65