xcat-core icon indicating copy to clipboard operation
xcat-core copied to clipboard

2.16.4: update command non operational for sles15.3 clients

Open wrussian opened this issue 3 years ago • 3 comments

I updated from version 2.16.3 to 2.16.4 for RH7. I installed sles15.3 utilizing the workaround described in #7230. After the installation completes with errors, most important 'otherpkgs' didn't ran successful, and I tried to re-run the postscript with help of updatenode, but received this error:

updatenode pnsd12 -V -P otherpkgs Running command on vlxmn02: ip -4 --oneline addr show |awk -F ' ' '{print $4}'|awk -F '/' '{print $1}' 2>&1

Running command on vlxmn02: chmod -R a+r /install/postscripts 2>&1

vlxmn02: Internal call command: xdsh pnsd12 --nodestatus -s -v -e /install/postscripts/xcatdsklspost 1 -m 10.232.121.101:80 'otherpkgs' --tftp /tftpboot --installdir /install --nfsv4 no -c -V Running command on vlxmn02: ip -4 --oneline addr show |awk -F ' ' '{print $4}'|awk -F '/' '{print $1}' 2>&1 Running command on vlxmn02: hostname 2>&1 Running command on vlxmn02: /opt/xcat/bin/pping pnsd12 2>&1 pnsd12: bash: /tmp/m9fr956M6M.dsh: No such file or directory pnsd12: rm: cannot remove '/tmp/m9fr956M6M.dsh': No such file or directory Error: [vlxmn02]: pnsd12 remote shell had exit code 1.

This error appears for every postscript triggered via the update command.

All ssh, xdsh commands started on the manager node execute without problems on the same sles15.3 client.

Older sles15.2 can execute any updatenode commands without problems.

wrussian avatar Aug 09 '22 07:08 wrussian

I forgot to mention explicitly that the errors are restritcte to the postbootscript - phase. Here otherpkgs and some bespoke scripts setting firewalld, IP for HCA and postfix configuration didn't work.

wrussian avatar Aug 09 '22 07:08 wrussian

@wrussian Can you show the contents of the otherpkgs list ? Also, after the node installation completes with errors, is there anything interesting in the /var/log/xcat/xcat.log file on the installed compute node ? For example, which one of the postscripts fails first ?

gurevichmark avatar Aug 09 '22 13:08 gurevichmark

The first one can be displayed here: #INCLUDE:/install/HPC-Image/buildDef/compute-sles15.3-x86_64/common/otherpkg.list#

Docker stuff

update-SLE-Module-Containers/noarch/docker-bash-completion-20.10.9_ce-156.1.noarch update-SLE-Module-Containers/x86_64/containerd-1.4.11-56.1.x86_64 update-SLE-Module-Containers/x86_64/docker-20.10.9_ce-156.1.x86_64 update-SLE-Module-Containers/x86_64/runc-1.0.2-23.1.x86_64

I attached the file included. I use explicit version name although stated different in the documentation to explicitly select the right version, because for some RPMs different versions are available in the repo, and I can select between different image definitions with help of underlying different (git) version of the otherpkgs file. (I will try to repeat the installation with a version containing the base RPM names only)

I grep-ed the xcat.log with -i -E 'error|fail|can't,|fatal|con not|return with [1-9]+' and carefully scanned it manually, but couldn't find a obvious error. I'm not 100% sure about large number of getcredential.awk - calls (all seem to succeed in the end; although the file is cat-enated and removed afterwards, because a criteria isn't met before(?). I added an example here:

  • RETRY=9
  • '[' 9 -eq 10 ']'
  • '[' 0 = 1 ']'
  • getcredentials.awk ssh_dsa_hostkey
  • grep -v '<'
  • sed -e 's/</</' -e 's/>/>/' -e 's/&/&/' -e 's/&quot/"/' -e 's/'/'''/' ++ cat /etc/ssh/ssh_host_dsa_key
  • MYCONT=
  • '[' -z '' ']'
  • '[' 0 = 0 ']'
  • let SLI=11096%10
  • let SLI=SLI+10
  • sleep 16
  • RETRY=10
  • '[' 10 -eq 10 ']'
  • break
  • egrep -i '^ssh_keys:' /etc/group
  • grep 'PRIVATE KEY' /etc/ssh/ssh_host_dsa_key
  • rm /etc/ssh/ssh_host_dsa_key
  • rm /tmp/ssh_dsa_hostkey

grep '^+ getcredentials' /var/log/xcat/xcat.log | wc -l 69 I can send you the xcat.log via e-mail. I wouldn't like to post it here, because of the IP addresses and names included.

The postbootscript execution log reads as: . . . Tue 09 Aug 2022 09:05:36 PM CEST [info]: xcat.deployment: Running /xcatpost/mypostscript.post Tue 09 Aug 2022 09:05:36 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: otherpkgs Tue 09 Aug 2022 09:13:50 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: otherpkgs return with 9 Tue 09 Aug 2022 09:13:50 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: syncfiles Tue 09 Aug 2022 09:13:56 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: syncfiles return with 0 Tue 09 Aug 2022 09:13:56 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles12/all_linksCrt.sh Tue 09 Aug 2022 09:13:56 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles12/all_linksCrt.sh return with 0 Tue 09 Aug 2022 09:13:56 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles12/all_postfix.sh Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles12/all_postfix.sh return with 0 Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles12/cleanUp.sh Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles12/cleanUp.sh return with 0 Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles12/compute_servicesEnable.sh Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles12/compute_servicesEnable.sh return with 1 Tue 09 Aug 2022 09:13:59 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles15/cfgChronyd.sh Tue 09 Aug 2022 09:14:00 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles15/cfgChronyd.sh return with 1 Tue 09 Aug 2022 09:14:00 PM CEST [info]: xcat.deployment.postbootscript: postbootscript start..: sles15/cfgFirewalld.sh Tue 09 Aug 2022 09:14:04 PM CEST [info]: xcat.deployment.postbootscript: postbootscript end...: sles15/cfgFirewalld.sh return with 0 Tue 09 Aug 2022 09:14:04 PM CEST [debug]: xcat.deployment.postbootscript: node boot failed, reporting status... Tue 09 Aug 2022 09:14:04 PM CEST [error]: xcat.deployment.postbootscript: provision completed with error.(pnsd12) Tue 09 Aug 2022 09:14:04 PM CEST [info]: xcat.deployment: /xcatpost/mypostscript.post return Tue 09 Aug 2022 09:14:04 PM CEST [info]: xcat.deployment: =============deployment ending==================== otherpkg.zip

wrussian avatar Aug 09 '22 15:08 wrussian

So more debug things you can try:

  • reduce otherpkgs list to a single entry
  • remove blank lines from otherpkg.list
  • uncomment set -x at the beginning of /install/postscripts/otherpkgs and then provision the node. A trace output in the /var/log/xcat/xcat.log file on the installed compute node might point to the reason why it exited with RC 9

gurevichmark avatar Aug 15 '22 18:08 gurevichmark

Many thanks for the additional hints! I think I found the reason why none of the update commands worked. There was an unnoticed regression in our GIT so we didn't included the files:

gzip binutils open-iscsi curl mdadm multipath-tools open-lldp fcoe-utils util-linux-systemd adaptec-firmware insserv-compat

available inside the xCAT pkglist template file to our production package list. After an update the re-installation worked smoothly and all update commands are operational again.

The otherpkgs exit with code 9 was caused by a missing tag inside the autoyast XML file. We like to keep the image as small as possible and always run zypper with option --no-recommends that maps to the tag

<software>
  <install_recommended config:type="boolean">false
  </install_recommended>
  . . .
 </software>

I added that (on my own risk) to the compute.sle15.tmpl. Because the installation repositories were created with this option, the installation failed. The workaround solved this our issue, which is 'customer' specific.

Many thanks for your support and please apologize the spam caused by an erroneous pkglist - file. I think you can close the ticket.

wrussian avatar Aug 16 '22 10:08 wrussian