xcat-core
xcat-core copied to clipboard
postscript syncfiles return with 1
Running xCAT 2.16.1 on CentOS 8.3
Compute node installs, but postscript syncfiles return with 1.
Other postscripts complete correctly. Mellanox OFED, etc.
Also, it's getting some files from somewhere. /etc/passwd is current. But the slurm.conf and munge.key files defined in synclists=/install/post/otherpkgs/slurm.synclist do not.
[root@demo~]# cat /install/post/otherpkgs/slurm.synclist /etc/slurm/slurm.conf -> /etc/slurm/slurm.conf /etc/munge/munge.key -> /etc/munge/munge.key
If I try to run it manually on the node, I get this:
[root@c001 ~]# /xcatpost/syncfiles
awk: /xcatpost/startsyncfiles.awk:25: fatal: can't open two way pipe /inet/tcp/0/127.0.0.1/400' for input/output (Address family not supported by protocol) awk: /xcatpost/startsyncfiles.awk:25: fatal: can't open two way pipe /inet/tcp/0/127.0.0.1/400' for input/output (Address family not supported by protocol)
If I run this on the master, it works: xdcp compute -F /install/post/otherpkgs/slurm.synclist
I never had this issue with CentOS 7.x
I see the same behaviour on 2.16.1 and RHELS8.3. I investigated it briefly and found that:
- For whatever reason, the test that checks for
/usr/bin/opensslinrunxcatpostfails (race condition?) - This causes
startsyncfiles.awkto fall back to using/inet/tcp /inet/tcpfails for some reason
My workaround was to move the syncfiles script from the postscripts to postbootscripts .
Perhaps, it takes longer to get OpenSSL installed on CentOS 8.3.
Could you test whether it is helpful to add the following loop before if (!system("test -f openssl")) in startsyncfiles.awk?
count = 0
while (system("test -f /usr/bin/openssl") && count < 10 ) {
system("sleep 1")
count += 1
}
The number of iterations can be adjusted for testing purposes.
In addition, if (!system("test -f openssl")) should be replaced by if (system("test -f /usr/bin/openssl")). There are two changes.
(1) Without the full path for openssl, the former always returns 1 (not found). The negation of 1 is 0 (false), so the error message is never printed.
(2) If /usr/bin/openssl is NOT there, test -f /usr/bin/openssl returns 1. Negation is NOT needed and the error message is printed accordingly.
@aaronhcarr and @buzh: Do you have CentOS 8.3 or RHEL 8.3 on both MN and CN? Do you see this problem with diskfull and/or diskless provisioning?
@buzh could you describe more about how runxcatpost is run? Is it part of the postscript processing during node provisioning? Or do you run it with updatenode after the node is booted?
I don't see /inet/tcp on my CentOS 8.3, RHEL 7.6 and RHEL8.4 nodes. Do you have it on yours? If not, /inet/tcp is likely no longer available.
I was installing CentOS 8.3 on both MN and computes.
The installs on computes were stateful.
On Thu, Jul 22, 2021, 3:14 PM peterwywong @.***> wrote:
@aaronhcarr https://github.com/aaronhcarr and @buzh https://github.com/buzh: Do you have CentOS 8.3 or RHEL 8.3 on both MN and CN? Do you see this problem with diskfull and/or diskless provisioning?
@buzh https://github.com/buzh could you describe more about how runxcatpost is run? Is it part of the postscript processing during node provisioning? Or do you run it with updatenode after the node is booted?
I don't see /inet/tcp on my CentOS 8.3, RHEL 7.6 and RHEL8.4 nodes. Do you have it on yours? If not, /inet/tcp is likely no longer available.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/6981#issuecomment-885202480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSCCUUO4SIHA4GNS2XH4YLTZB3Y5ANCNFSM45DFHKKA .
@aaronhcarr: Is the suggested while loop above helpful to your provisioning?
/usr/bin/openssl is called by runxcatpost directly and by syncfiles indirectly through startsyncfiles.awk.
@aaronhcarr and @buzh : in order to understand the syncfile problem and runxcatpost better, it would be useful if you can provide additional log information by adding "set -x" in your bash file, and by going through your process of reproducing the problem.
In the case of startsyncfiles.awk, an additional print statement for testing openssl can tell whether openssl is available when the awk script runs.
Usually, we invoke syncfiles on compute nodes from the MN side. Some setup by MN is required.
And /inet/tcp seems to be a very old function and is not found in recent distros.
For me, this was mostly important for reinstalls, since the nodes were not able to initiate that sync.
So if you reinstalled a node, or a batch of nodes, instead of immediately becoming available, you would have to manually sync the files from the MN so that they would get slurm.conf, etc before they could be back in service.
In CentOS 7.x, that all worked well. Nodes would automatically go back in service after install. In CentOS 8.x, that doesn't happen anymore.
On Mon, Jul 26, 2021 at 2:46 PM peterwywong @.***> wrote:
/usr/bin/openssl is called by runxcatpost directly and by syncfiles indirectly through startsyncfiles.awk.
@aaronhcarr https://github.com/aaronhcarr and @buzh https://github.com/buzh : in order to understand the syncfile problem and runxcatpost better, it would be useful if you can provide additional log information by adding "set -x" in your bash file, and by going through your process of reproducing the problem.
In the case of startsyncfiles.awk, an additional print statement for testing openssl can tell whether openssl is available when the awk script runs.
Usually, we invoke syncfiles on compute nodes from the MN side. Some setup by MN is required.
And /inet/tcp seems to be a very old function and is not found in recent distros.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/6981#issuecomment-887048986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSCCURM4DCWIDBGQPOWNTTTZXJSDANCNFSM45DFHKKA .
@aaronhcarr Please provide more details about your installation process?
For discussion purposes, let's assume that you have two nodes: Node1 (the management node) and Node 2 (a compute node) and you want to install CentOS 8.3 on both.
Is it correct that you have no problem installing CentOS 8.3 on Node1, but have a problem of getting files specified on synclists on Node 2?
Once you saw the problem on Node2, you were able to fix it by manually running xdcp compute -F /install/post/otherpkgs/slurm.synclist on Node1. Is it right?
Is the problem about a failure of getting files on synclists on Node2 through rinstall Node1 osimage=<CentOS 8.3 image>? If so, getting more log information as I suggested above can help.
If the process I described above is NOT what you did, please describe the detailed steps of reproducing it.
@buzh mentioned about an openssl problem. Do you think your problem is also openssl related?
No issues installing CentOS 8.3 and xCAT 2.16.1 on node1 (the one that you named node1).
And yes, that is basically it. Install node2 via nodeset node2 osimage
And yes, if I run xdcp compute -F from node1, it will push the files to node2. However, during the osimage process, there's an xCAT built in function to call syncfiles (I'm assuming this because I've never manually configured it to do this ). It worked on CentOS 7 (as I mentioned, I would know a node completed reimaging because it would be up in slurm), but in CentOS 8, I had to wait for the osimage process to complete, then manually run xdcp compute -F then pdsh -w node[001-00X] systemctl restart slurmd in order to achieve the same thing.
On Tue, Jul 27, 2021 at 2:28 PM peterwywong @.***> wrote:
@aaronhcarr https://github.com/aaronhcarr Please provide more details about your installation process?
For discussion purposes, let's assume that you have two nodes: Node1 (the management node) and Node 2 (a compute node) and you want to install CentOS 8.3 on both.
Is it correct that you have no problem installing CentOS 8.3 on Node1, but have a problem of getting files specified on synclists on Node 2?
Once you saw the problem on Node2, you were able to fix it by manually running xdcp compute -F /install/post/otherpkgs/slurm.synclist on Node1. Is it right?
Is the problem about a failure of getting files on synclists on Node2 through rinstall Node1 osimage=<CentOS 8.3 image>? If so, getting more log information as I suggested above can help.
If the process I described above is NOT what you did, please describe the detailed steps of reproducing it.
@buzh https://github.com/buzh mentioned about an openssl problem. Do you think your problem is also openssl related?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/6981#issuecomment-887847079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSCCUXC6DNYCDY74D7XXQTTZ4QHVANCNFSM45DFHKKA .
And /inet/tcp seems to be a very old function and is not found in recent distros.
/inet/tcp is a virtual file present in GNU awk to enable networking. You will not see this file in your shell. See https://www.gnu.org/software/gawk/manual/html_node/TCP_002fIP-Networking.html
Similar issue observed when deploying a stateful RHEL 8.4 OS Image with xCAT 2.16.3:
- File synchronization fails during deployment
- File synchronization succeeds when called through
updatenode --synccommand afterwards
Issue could be successfully fixed by explicitely adding the openssl package to the Package List of the OS Image.
Maybe the openssl package should be added to xCAT default Package List for compute nodes?
Credit: @elc3leb
@nicolas-tallet @aaronhcarr @elc3leb
I am having trouble reproducing this issue. When performing a stateful install of RHEL 8.5 on a compute node in a flat environment with the default pkglist list, I see that openssl gets installed on the node without it being explicitly specified in the pkglist due to dependency resolution and the files get synced to the compute node.
[root@f6u13k19 ~]# lsdef -t osimage rhels8.5.0-ppc64le-install-compute
Object name: rhels8.5.0-ppc64le-install-compute
imagetype=linux
osarch=ppc64le
osdistroname=rhels8.5.0-ppc64le
osname=Linux
osvers=rhels8.5.0
otherpkgdir=/install/post/otherpkgs/rhels8.5.0/ppc64le
pkgdir=/install/rhels8.5.0/ppc64le
pkglist=/opt/xcat/share/xcat/install/rh/compute.rhels8.pkglist
profile=compute
provmethod=install
synclists=/install/post/otherpkgs/slurm.synclist
template=/opt/xcat/share/xcat/install/rh/compute.rhels8.tmpl
[root@f6u13k19 ~]# cat /opt/xcat/share/xcat/install/rh/compute.rhels8.pkglist
@^minimal-environment
chrony
net-tools
nfs-utils
openssh-server
rsync
util-linux
wget
python3
tar
bzip2
perl-interpreter
[root@f6u13k19 ~]# cat /install/post/otherpkgs/slurm.synclist
/etc/slurm/slurm.conf -> /etc/slurm/slurm.conf
/etc/munge/munge.key -> /etc/munge/munge.key
[root@f6u13k19 ~]# rinstall f6u13k20 osimage=rhels8.5.0-ppc64le-install-compute
[root@f6u13k19 ~]# xdsh f6u13k20 "rpm -q openssl"
f6u13k20: openssl-1.1.1k-4.el8.ppc64le
[root@f6u13k19 ~]# xdsh f6u13k20 "rpm -qa | wc -l"
f6u13k20: 453
[root@f6u13k19 ~]# xdsh f6u13k20 "ls /etc/slurm/slurm.conf;ls /etc/munge/munge.key"
f6u13k20: /etc/slurm/slurm.conf
f6u13k20: /etc/munge/munge.key
[root@f6u13k19 ~]# xdsh f6u13k20 "grep syncfiles /var/log/xcat/xcat.log"
f6u13k20: Wed May 25 13:06:34 EDT 2022 [info]: xcat.deployment.postscript: Running postscript: syncfiles
f6u13k20: Wed May 25 13:06:35 EDT 2022 [info]: xcat.deployment.postscript: postscript syncfiles return with 0
Can you share your pkglist?
When using your pkglist without explicitly adding openssl to it, do you observe that openssl does not get installed on the compute node?
Are you using a flat (management node + compute nodes) or hierarchical (management node + service nodes + computes nodes) configuration?
Hello @besawn
Tomorow, is bank holiday in france. I will check the pkglist friday and reply you.
But from memory I had to specifically add this package on pkglist ( for us it was state less/full images )
i'll give you all details friday
@besawn : Actually, we noticed afterwards that we were missing the template Package List as include to our OS Image Package List.
Adding:
#INCLUDE:/opt/xcat/share/xcat/netboot/rh/compute.rhels8.ppc64le.pkglist#
to our OS Image Package List made it unnecessary to explicitly add the openssl package, and the issue was not present anymore.
So our workaround was actually required because we forgot to include the template Package List - so it can be forgotten to focus on @aaronhcarr 's primary issue.
I spent some time trying to get to the bottom of this, and believe I found the root cause, at least for Rocky 8 and 9 and RHEL 8 and 9. See https://github.com/xcat2/xcat-core/pull/7329
As an aside, I spent way too much time trying do debug the original awk code and decided to write a drop in replacement for syncfiles that should work on any modern Linux distribution. For anyone interested it can be found here