xcat-core
xcat-core copied to clipboard
Unable to get kernel image in netboot
Hi,
I've been trying to setup a stateless cluster following OpenHPC recipes in CentOS8 but have run into an issue. When trying to boot on PXE I'm able to get the booting scrip for the nodes, but when requesting the kernel image it returns my an HTTP error as shown in the image bellow:
When running xcatpobe on master node it points out that everything is ok, the warning comes from the fact that nameserver 150.xxx.x.x is our public network that doesn't have acces to the cluster's internal interface.
[root@mn nodes]# xcatprobe xcatmn -i enp8s0
[mn]: Checking all xCAT daemons are running... [ OK ]
[mn]: Checking xcatd can receive command request... [ OK ]
[mn]: Checking 'site' table is configured... [ OK ]
[mn]: Checking provision network is configured... [ OK ]
[mn]: Checking 'passwd' table is configured... [ OK ]
[mn]: Checking important directories(installdir,tftpdir) are configured... [ OK ]
[mn]: Checking SELinux is disabled... [ OK ]
[mn]: Checking HTTP service is configured... [ OK ]
[mn]: Checking TFTP service is configured... [ OK ]
[mn]: Checking DNS service is configured... [WARN]
[mn]: DNS nameserver 150.xxx.x.x can not resolve 192.168.1.101
[mn]: Checking DHCP service is configured... [ OK ]
[mn]: Checking NTP service is configured... [ OK ]
[mn]: Checking rsyslog service is configured... [ OK ]
[mn]: Checking firewall is disabled... [ OK ]
[mn]: Checking minimum disk space for xCAT ['/tmp' needs 1GB;'/install' needs 10GB;'/var' needs 1GB]... [ OK ]
[mn]: Checking Linux ulimits configuration... [ OK ]
[mn]: Checking network kernel parameter configuration... [ OK ]
[mn]: Checking xCAT daemon attributes configuration... [ OK ]
[mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log... [ OK ]
[mn]: Checking xCAT management node IP: <192.168.1.101> is configured to static... [ OK ]
[mn]: Checking dhcpd.leases file is less than 100M... [ OK ]
[mn]: Checking DB packages installation... [ OK ]
=================================== SUMMARY ====================================
[MN]: Checking on MN... [ OK ]
Checking DNS service is configured... [WARN]
DNS nameserver 150.162.1.1 can not resolve 192.168.1.101
DHCP server also seems to be working fine
[root@mn nodes]# xcatprobe detect_dhcpd -i enp8s0 -m 18:66:da:1d:63:0e
Start to detect DHCP, please wait 10 seconds [INFO]
++++++++++++++++++++++++++++++++++ [INFO]
There are 1 servers replied to dhcp discover. [INFO]
Server:192.168.1.101 assign IP [192.168.1.102]. The next server is [192.168.1.101]! [INFO]
++++++++++++++++++++++++++++++++++ [INFO]
I'm also able to access all files that are required in the PXE boot scrip
#!gpxe
#netboot centos-stream8-x86_64-compute
imgfetch -n kernel http://${next-server}:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
imgload kernel
imgargs kernel imgurl=http://${next-server}:80//install/netboot/centos-stream8/x86_64/compute/rootimg.cpio.gz XCAT=${next-server}:3001 NODE=c1 FC=0 XCATHTTPPORT=80 console=tty0 console=ttyS0,115200 BOOTIF=01-${netX/machyp}
imgfetch -n initrd http://${next-server}:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/initrd-stateless.gz
imgexec kernel
Wireshark points out a TFTP error that is aborting the transactions
Capturing on 'enp8s0'
1 0.000000000 0.0.0.0 → 255.255.255.255 DHCP 590 DHCP Discover - Transaction ID 0xda1d630e
2 0.000200027 192.168.1.101 → 255.255.255.255 DHCP 342 DHCP Offer - Transaction ID 0xda1d630e
3 4.008654276 0.0.0.0 → 255.255.255.255 DHCP 590 DHCP Request - Transaction ID 0xda1d630e
4 4.008908360 192.168.1.101 → 255.255.255.255 DHCP 342 DHCP ACK - Transaction ID 0xda1d630e
5 4.016202763 Dell_1d:63:0e → Broadcast ARP 60 Who has 192.168.1.101? Tell 192.168.1.102
6 4.016221132 9c:53:22:48:50:bb → Dell_1d:63:0e ARP 42 192.168.1.101 is at 9c:53:22:48:50:bb
7 4.016258287 192.168.1.102 → 192.168.1.101 TFTP 73 Read Request, File: xcat/xnba.kpxe, Transfer type: octet, tsize=0
8 4.019157973 192.168.1.101 → 192.168.1.102 TFTP 56 Option Acknowledgement, tsize=67650
9 4.019200647 192.168.1.102 → 192.168.1.101 TFTP 60 Error Code, Code: Not defined, Message: TFTP Aborted
10 4.021014854 192.168.1.102 → 192.168.1.101 TFTP 78 Read Request, File: xcat/xnba.kpxe, Transfer type: octet, blksize=1456
11 4.023521541 192.168.1.101 → 192.168.1.102 TFTP 57 Option Acknowledgement, blksize=1456
If you guys have any clue of what it might be, I'd be very glad to hear it. Thx
From any other machine on the same network segment as the failing node, would you try running wget or curl for http://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel ? (and the rootimg and initrd too)
Thanks for the reply @samveen.
From any other machine on the same network segment as the failing node, would you try running
wgetorcurlforhttp://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel? (and the rootimg and initrd too)
Yes, I'm able to get the files from a client in the same network both via wget and tftp. It seems that only xNBA isn't able to get the files, which for me seems very weird.
[root@client ~]# wget http://192.168.1.101:80/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
--2023-06-26 11:37:07-- http://192.168.1.101/tftpboot/xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
Connecting to 192.168.1.101:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10876640 (10M)
Saving to: ‘kernel’
kernel 100%[==============================================================================>] 10.37M --.-KB/s in 0.09s
2023-06-26 11:37:07 (112 MB/s) - ‘kernel’ saved [10876640/10876640]
[root@client ~]# tftp 192.168.1.101 -v
Connected to 192.168.1.101 (192.168.1.101), port 69
tftp> status
Connected to 192.168.1.101.
Mode: netascii Verbose: on Tracing: off Literal: off
Rexmt-interval: 5 seconds, Max-timeout: 25 seconds
tftp> get xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel
getting from 192.168.1.101:xcat/osimage/centos-stream8-x86_64-netboot-compute/kernel to kernel [netascii]
Received 10953854 bytes in 3.2 seconds [27783387 bit/s]
The same occurs for rootimg and initrd.
Looking at the error code as listed by xnba (iPXE), there seems to be something going on with httpd on the master when xnba requests the URL for the kernel, which causes the failure in the xnba HTTP core (in net/tcp/httpcore.c). Would you try and check webserver logs on the master to check what might be causing the requests to fail?