xcat-core
xcat-core copied to clipboard
postscript remoteshell loops for a long time
This fault is probably on my site, but I cannot figure out, what the reason is. Hopefully you guys can point me in the right direction: When a node boots it eventually reaches the state "xcat.deployment.postbootscript: postbootscript start..: remoteshell". Running this script it falls into a loop because it cannot retrieve the ssh keys from the xcat server. I enabled xcatdebug to shed some light on what is going on. This is what I see in /var/log/xcat/xcat.log on the client compute node:
Mon Mar 25 08:30:47 CET 2024 [info]: xcat.deployment.postbootscript: postbootscript start..: remoteshell
+ '[' -n xcat.deployment.postbootscript ']'
+ log_label=xcat.deployment.postbootscript
+ umask 0077
+ '[' -f /etc/os-release ']'
+ cat /etc/os-release
+ grep -i -e '^NAME=[ "'\'']*Cumulus Linux[ "'\'']*$'
++ uname -s
++ tr A-Z a-z
+ '[' linux = linux ']'
++ dirname ./remoteshell
+ str_dir_name=.
+ . ./xcatlib.sh
++ declare -a array_nic_params
++ declare -a array_extra_param_names
++ declare -a array_extra_param_values
+ '[' -e /etc/xCATMN ']'
+ '[' -n '' ']'
++ uname -s
+ '[' Linux = AIX ']'
+ master=10.10.0.5
+ useflowcontrol=0
+ '[' '' = YES ']'
+ '[' '' = yes ']'
+ '[' '' = 1 ']'
+ '[' -r /etc/ssh/sshd_config ']'
+ logger -t xcat.deployment.postbootscript -p local4.info 'remoteshell: setup /etc/ssh/sshd_config and ssh_config'
+ cp /etc/ssh/sshd_config /etc/ssh/sshd_config.ORIG
+ sed -i '/X11Forwarding /d' /etc/ssh/sshd_config
+ echo 'X11Forwarding yes'
+ sed -i '/MaxStartups /d' /etc/ssh/sshd_config
+ echo 'MaxStartups 1024'
+ '[' '' = 1 ']'
+ '[' -r /etc/ssh/ssh_config ']'
+ sed -i '/StrictHostKeyChecking /d' /etc/ssh/ssh_config
+ echo 'StrictHostKeyChecking no'
+ xcatpost=xcatpost
+ '[' -d /xcatpost/_ssh ']'
+ logger -p local4.info -t xcat.deployment.postbootscript 'Install: setup root .ssh'
+ cd /xcatpost/_ssh
+ mkdir -p /root/.ssh
+ cp -f authorized_keys copy.sh /root/.ssh
+ cd -
+ chmod 700 /root/.ssh
+ chmod 600 /root/.ssh/authorized_keys /root/.ssh/copy.sh
+ '[' '!' -x /usr/bin/openssl ']'
+ CREDPID=3021
+ sleep 1
+ allowcred.awk
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -E -v '</{0,1}xcatresponse>|</{0,1}serverdone>'
+ sed -e 's/</</' -e 's/>/>/' -e 's/&/&/' -e 's/"/"/' -e 's/'/'\''/'
+ grep -E '<error>' /tmp/ssh_dsa_hostkey
+ '[' 1 -ne 0 ']'
+ cat /tmp/ssh_dsa_hostkey
+ grep -E -v '</{0,1}errorcode>|/{0,1}data>|</{0,1}content>|</{0,1}desc>'
+ logger -t xcat.deployment.postbootscript -p local4.info 'remoteshell: getting ssh_host_dsa_key'
+ MAX_RETRIES=10
+ RETRY=0
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=31275%10
+ let SLI=SLI+10
+ sleep 15
+ RETRY=1
+ '[' 1 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/</</' -e 's/>/>/' -e 's/&/&/' -e 's/"/"/' -e 's/'/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=13752%10
+ let SLI=SLI+10
+ sleep 12
+ RETRY=2
+ '[' 2 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/</</' -e 's/>/>/' -e 's/&/&/' -e 's/"/"/' -e 's/'/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=341%10
+ let SLI=SLI+10
+ sleep 11
+ RETRY=3
+ '[' 3 -eq 10 ']'
+ '[' 0 = 1 ']'
+ getcredentials.awk ssh_dsa_hostkey
+ grep -v '<'
+ sed -e 's/</</' -e 's/>/>/' -e 's/&/&/' -e 's/"/"/' -e 's/'/'\''/'
++ cat /etc/ssh/ssh_host_dsa_key
+ MYCONT=
+ '[' -z '' ']'
+ '[' 0 = 0 ']'
+ let SLI=29739%10
+ let SLI=SLI+10
+ sleep 19
+ RETRY=4
+ '[' 4 -eq 10 ']'
...
Meanwhile the server logs:
Mar 25 08:40:04 mgmtnode xcat[263816]: DEBUG xcatd: connection from node035
Mar 25 08:40:04 mgmtnode xcat[263816]: DEBUG xcatd: open new process : xcatd SSL: getcredentials for node035
Mar 25 08:40:04 mgmtnode xcat[263816]: INFO xCAT: Allowing getcredentials ssh_host_dsa_key from node035
Mar 25 08:40:04 mgmtnode xcat[263817]: DEBUG xcatd: dispatch request 'getcredentials ssh_host_dsa_key' to plugin 'credentials'
Mar 25 08:40:04 mgmtnode xcat[263817]: DEBUG xcatd: handle request 'getcredentials' by plugin 'credentials''s process_request
Mar 25 08:40:04 mgmtnode xcat[263817]: ERR The node (node035) is not ready, ignore it.
Mar 25 08:40:05 mgmtnode xcat[263816]: DEBUG xcatd: close connection with node035
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: connection from node035
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: open new process : xcatd SSL: getcredentials for node035
Mar 25 08:40:15 mgmtnode xcat[263824]: INFO xCAT: Allowing getcredentials ssh_host_dsa_key from node035
Mar 25 08:40:15 mgmtnode xcat[263825]: DEBUG xcatd: dispatch request 'getcredentials ssh_host_dsa_key' to plugin 'credentials'
Mar 25 08:40:15 mgmtnode xcat[263825]: DEBUG xcatd: handle request 'getcredentials' by plugin 'credentials''s process_request
Mar 25 08:40:15 mgmtnode xcat[263825]: ERR The node (node035) is not ready, ignore it.
Mar 25 08:40:15 mgmtnode xcat[263824]: DEBUG xcatd: close connection with node035
Searching the web I found, that is command is, what is being run by the remoteshell script, unfortunately running it manually gives an empty result
USEOPENSSLFORXCAT=yes XCATSERVER=10.10.0.5:3001 /xcatpost/getcredentials.awk ssh_dsa_hostkey
<xcatresponse>
<serverdone></serverdone>
</xcatresponse>
While the loops runs, I can check /tmp directory, the keyfile is there, but empty (probably because an empty data was redirect to that file):
[root@node035 ~]# ll /tmp/
-rwxr-xr-x 1 root root 39609 Mar 25 09:30 jjFPgeyQIO.dsh
drwxr-xr-x 2 root root 40 Mar 25 08:30 postage
-rw------- 1 root root 0 Mar 25 09:30 ssh_dsa_hostkey
drwx------ 3 root root 60 Mar 25 09:30 systemd-private-4eea0db9428746cc9942ea2c6e404a84-chronyd.service-XKbOBQ
-rw-r--r-- 1 root root 101021 Mar 25 09:30 wget.log
So I tried to predeploy the keys via syncfiles into the image, this worked, because I can ssh into the node while it boots, but the loop still persists, so I guess the problem is not the key itself. The correct keys are in fact still there when the node finally finished booting, I guess this is because of a final "syncfile" process at boottime overwriting the fresh generated keys due to the failing remoteshell script.
What additional information could I provide to help fixing this issue?
Thank you in advance!
I'm seeing the same/similar issue. Trying to debug it now. For me it looks like an issue with getcredentials.awk, but not getting much debug info. For now, I have edited the remoteshell script and set MAX_RETRIES from 10 to 1. This decreases the time to a more reasonable amount. Not sure if you've come up with a different work around or if you've figured out what's going on.
Would you the output of the following to the initial post, for additional info:
lsdef node035nodestat node035
Thank you for coming back to us, here is the requested output:
lsdef node035
Object name: node035
appstatus=xend=down,sshd=up,rdp=down,https=down,pbs=up,msrpc=down
appstatustime=07-09-2020 06:33:56
arch=x86_64
bmc=node035.ipmi
bmcport=3
chain=runcmd=bmcsetup,shell
chassis=MyChassisID-123
consoleenabled=1
currchain=shell
currstate=netboot rocky8.8-x86_64-compute
groups=compute_192,compute,intel,rack02,all,compute_40c
height=1
ip=10.10.1.35
mac=a4:bf:01:47:8b:c1!node035.cluster|a4:bf:01:47:8b:c5!node035.ipmi
mgt=ipmi
netboot=xnba
nicaliases.ib0=node035
nicaliases.ipmi=node035
nicips.eno1=10.10.1.35
nicips.ipmi=10.11.1.35
nicips.ib0=10.12.1.35
nicnetworks.eno1=10_10_0_0-255_255_0_0
nicnetworks.ipmi=10_11_0_0-255_255_0_0
nicnetworks.ib0=10_12_0_0-255_255_0_0
nictypes.eno1=Ethernet
nictypes.ipmi=bmc
nictypes.ib0=Infiniband
os=rocky8.8
postbootscripts=otherpkgs,setroute
postscripts=setupntp,syslog,remoteshell,syncfiles,org_final,confignetwork,setroute
profile=compute
provmethod=rocky8.8-x86_64-netboot-compute
rack=rack02
routenames=defgw10_10_0_5
serial=253089-1
serialport=0
serialspeed=115200
slot=1
status=booted
statustime=03-26-2024 16:03:39
unit=35
updatestatus=failed
updatestatustime=03-26-2024 15:38:18
nodestat node035
node035: sshd
I'm seeing the same/similar issue. Trying to debug it now. For me it looks like an issue with getcredentials.awk, but not getting much debug info. For now, I have edited the remoteshell script and set MAX_RETRIES from 10 to 1. This decreases the time to a more reasonable amount. Not sure if you've come up with a different work around or if you've figured out what's going on.
I was able to resolve this issue last week: The root cause for me was, that the booting node had firewalld enabled and the internal interface was set to the public zone, so no incoming communication was possible. This is new behavior with Rocky Linux 8 or possible with all RHEL 8 derivates. I was coming from CentOS 7, which had firewalld disabled, afaik. If this fixes the issue for @rlcto then this issue can be closed. Sorry for the trouble!