rocky-tools
rocky-tools copied to clipboard
migration failure with MLNX_OFED_LINUX 4.9 (LTS) installed
System is installed with the following which are related to this issue:
- CentOS 8.5 (fully updated)
- Bright Cluster Manager 9.0-17 (fully updated)
- MLNX_OFED_LINUX 4.9 (LTS) (Bright installer)
This is second migration attempt of development system secondary head node after re-image following initial failed migration with manual resolution.
Previous migration was eventually completed and the system was running Rocky 8.5 with BCM 9.0-17 and MLNX_OFED_LINUX 4.9 (LTS) without issues. In attempt to confirm all issues were resolved the system was restored to pre-migration state, any previous issues (extra installed kernels and remaining rhel8u0 kmods with no matching kernel and missing deps) were resolved and migration was re-attempted.
It seems possible that the addition of --setopt=<reponame>.excludepkgs=
options may resolve (may be able to investigate) and could be supported in a future version of migrate2rocky.sh
.
Configuration of dnf
via /etc/dnf/dnf.conf
to ignore MLNX_OFED_LINUX packages may also resolve (for example)...
# dnf check -v
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (CentOS Linux 8; generic; Linux.x86_64)'
Excludes in dnf.conf: tog-pegasus-devel mpi-selector.x86_64 mlnx-ofa_kernel kmod-mlnx-ofa_kernel mlnx-ofa_kernel-devel kmod-kernel-mft-mlnx knem kmod-knem ofed-scripts rdma-core libibverbs librdmacm libibumad infiniband-diags rdma-core-devel libibverbs-utils ibsim ibacm librdmacm-utils opensm-libs opensm opensm-devel opensm-static dapl dapl-devel dapl-devel-static dapl-utils perftest mstflint mft srp_daemon ibutils2 dump_pr ar_mgr qperf ucx ucx-devel sharp ucx-cma ucx-ib ucx-rdmacm ucx-knem hcoll openmpi mlnx-ethtool mlnx-iproute2 mlnxofed-docs libmthca-static compat-dapl-static compat-dapl-static-1.2.5 dapl-static libibverbs-rocee libibverbs-rocee-devel libibverbs-rocee-devel-static
...and this will be attempted again via manual resolution with existing Rocky repository configuration in place...
(DEV - 3HZVN23 - PASSIVE) [root@devmgr2 : migrate2rocky]# dnf repolist
repo id repo name
appstream Rocky Linux 8 - AppStream
baseos Rocky Linux 8 - BaseOS
devel Rocky Linux 8 - Devel WARNING! FOR BUILDROOT AND KOJI USE
epel Extra Packages for Enterprise Linux 8 - x86_64
epel-modular Extra Packages for Enterprise Linux Modular 8 - x86_64
extras Rocky Linux 8 - Extras
powertools Rocky Linux 8 - PowerTools
Any additional recommendations for resolving manually would be appreciated.
@pajamian With packages excluded...
# dnf -v check
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (Rocky Linux 8.5; generic; Linux.x86_64)'
Excludes in dnf.conf: ar_mgr, cm-docker, cm-etcd, cm-kubernetes118, compat-dapl-static, compat-dapl-static-1.2.5, dapl, dapl-devel, dapl-devel-static, dapl-static, dapl-utils, dump_pr, hcoll, ibacm, ibsim, ibutils2, infiniband-diags, kmod-kernel-mft-mlnx, kmod-knem, kmod-mlnx-ofa_kernel, knem, libibumad, libibverbs, libibverbs-rocee, libibverbs-rocee-devel, libibverbs-rocee-devel-static, libibverbs-utils, libmthca-static, librdmacm, librdmacm-utils, mft, mlnx-ethtool, mlnx-iproute2, mlnx-ofa_kernel, mlnx-ofa_kernel-devel, mlnxofed-docs, mpi-selector.x86_64, mstflint, ofed-scripts, openmpi, opensm, opensm-devel, opensm-libs, opensm-static, perftest, qperf, rdma-core, rdma-core-devel, sdsc_gsi-openssh, sdsc_gsi-openssh-clients, sdsc_gsi-openssh-server, sharp, slurm20*, srp_daemon, tog-pegasus, tog-pegasus-devel, ucx, ucx-cma, ucx-devel, ucx-ib, ucx-knem, ucx-rdmacm
...it appears manual dnf distro-sync
may be possible...
# dnf -y distro-sync 2>&1 | tee manual-dnf-distro-sync.log
Last metadata expiration check: 0:48:31 ago on Thu 10 Feb 2022 10:59:40 AM PST.
Dependencies resolved.
===============================================================================================================================
Package Arch Version Repository Size
===============================================================================================================================
Installing:
kernel x86_64 4.18.0-348.12.2.el8_5 baseos 7.0 M
kernel-core x86_64 4.18.0-348.12.2.el8_5 baseos 38 M
kernel-devel x86_64 4.18.0-348.12.2.el8_5 baseos 20 M
kernel-modules x86_64 4.18.0-348.12.2.el8_5 baseos 30 M
Upgrading:
acl x86_64 2.2.53-1.el8.1 baseos 80 k
apr-util x86_64 1.6.1-6.el8.1 appstream 104 k
...<snip>...
zlib-devel-1.2.11-17.el8.x86_64
zsh-5.5.1-6.el8_1.2.x86_64
zstd-1.4.4-1.el8.x86_64
Complete!
With this knowlege it's likely the next migration will go a fair bit more smoothly if not automatically.
Thanks for all the work you've put into migrate2rocky.sh
.
@pajamian Final update for this issue report.
Migration went more smoothly on the primary headnode of this system with the addition of the exclusion of many/most non-{CentOS|Fedora} packages from the migration via an exclude=...
entry in /etc/dnf/dnf.conf
.
The excluded packages are all from local installation or install from alternate repositories that migrate2rocky.sh
cannot map/manage.
After migration and reboot the repositories were re-enabled, the exclude=...
entry restored to the less restrictive system default and any other packages needing update were updated (there were none in the final instance).
The same general sequence was used to migrate chroots
running via systemd-nspawn
although those were even easier because migrate2rocky.sh
now handled vault changes for CentOS and clones of chroot
enviroments could be safely migrated and, if a failure occurred, thown away.
The sequence of commands to build up this list of excludable packages was similar for the physical host and chroot
and was...
chroot example
# ./migrate2rocky.sh -V
# ls -l /root/convert
total 1532
-rw-r--r-- 1 root root 101541 Feb 11 14:05 node-installer-rpm-list-begin.log
-rw-r--r-- 1 root root 1464834 Feb 11 14:05 node-installer-rpm-list-verified-begin.log
# echo "exclude=$(grep -Ev "centos|fedora" /root/convert/node-installer-rpm-list-begin.log | grep -v gpg-pubkey | column -s\| -t | awk '{print $1}' | tr '\n' ' ')" >> /etc/dnf/dnf.conf
# dnf check -v
Loaded plugins: builddep, changelog, config-manager, copr, debug, debuginfo-install, download, generate_completion_cache, groups-manager, needs-restarting, playground, repoclosure, repodiff, repograph, repomanage, reposync
DNF version: 4.7.0
cachedir: /var/cache/dnf
User-Agent: constructed: 'libdnf (CentOS Linux 8; generic; Linux.x86_64)'
Excludes in dnf.conf: MegaCli ar_mgr dapl-devel-static dapl-devel dapl-utils dapl dump_pr elrepo-release hcoll ibacm ibsim ibutils2 infiniband-diags kmod-bnxt_en kmod-elx-lpfc kmod-isert kmod-iser kmod-kernel-mft-mlnx kmod-knem kmod-megaraid_sas kmod-mlnx-ofa_kernel kmod-rshim kmod-srp knem libibumad libibverbs-utils libibverbs librdmacm-utils librdmacm lustre-client-dkms lustre-client mft mlnx-ethtool mlnx-fw-updater mlnx-iproute2 mlnx-ofa_kernel-devel mlnx-ofa_kernel mlnxofed-docs mpi-selector mstflint ofed-scripts openmpi opensm-devel opensm-libs opensm-static opensm perftest qperf rdma-core-devel rdma-core sharp srp_daemon srvadmin-argtable2 srvadmin-hapi srvadmin-idracadm7 telegraf ucx-cma ucx-devel ucx-ib ucx-knem ucx-rdmacm ucx
# ./migrate2rocky.sh -V -r
migrate2rocky - Begin logging at Fri Feb 11 14:15:00 2022.
Creating a list of RPMs installed: begin
Verifying RPMs installed against RPM database: begin
Removing dnf cache
Preparing to migrate CentOS Linux 8 to Rocky Linux 8.
Error: Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: No URLs in mirrorlist
Baseurl for appstream is invalid, setting to https://dl.rockylinux.org/vault/centos/8.5.2111/AppStream/x86_64/os/.
Error: Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: No URLs in mirrorlist
Baseurl for baseos is invalid, setting to https://dl.rockylinux.org/vault/centos/8.5.2111/BaseOS/x86_64/os/.
Determining repository names for CentOS Linux 8......
Found the following repositories which map from CentOS Linux 8 to Rocky Linux 8:
CentOS Linux 8 Rocky Linux 8
appstream appstream
baseos baseos
extras extras
...<snip>...
xkeyboard-config-2.28-1.el8.noarch
zip-3.0-23.el8.x86_64
zlib-1.2.11-17.el8.x86_64
Removed:
kernel-4.18.0-147.el8.x86_64 kernel-core-4.18.0-147.el8.x86_64
kernel-modules-4.18.0-147.el8.x86_64
Complete!
Creating a list of RPMs installed: finish
Verifying RPMs installed against RPM database: finish
You may review the following files:
/root/convert/node-installer-rpm-list-begin.log
/root/convert/node-installer-rpm-list-finish.log
/root/convert/node-installer-rpm-list-verified-begin.log
/root/convert/node-installer-rpm-list-verified-finish.log
Done, please reboot your system.
A log of this installation can be found at /var/log/migrate2rocky.log
In my experiece with multiple migration attempts on these systems some packages could be safely removed and no longer needed to be excluded explicitly but others could not. Still others would only trigger a failure during transaction processing of dnf distro-sync ...
and, currently, migrate2rocky.sh
cannot be re-run if you fail at this stage.
I'm not convinced this sequence should be generalized and added explicitly to migrate2rocky.sh
but perhaps it's possible something like this could be done if/when only the -V
option is specified.
This might be used to alert the user to the potential list of packages that could break dnf distro-sync ...
and suggesting they do more research before attempting migrate2rocky.sh -r
when it's possible (likely?) that it will fail anyway.
In short, if you feel there is anything useful in this issue that can be added to migrate2rocky.sh
then by all means let's do it. I am happy to provide more details if you need.
Otherwise, it'll be fine to clone this issue and perhaps keep it in mind if others show up with similar problems. Clever folks will search the closed issues for hints and maybe stumble on this potential solution without any additional help.
Thanks again for all the work on migrate2rocky.sh
.
Well, I think running dnf check ahead of time and checking the result will help. Also it makes me think that package exclusions should be copied over from the source repo to the RockyLinux equivalents, so if there are exclude= lines in, say, appstream currently appstream gets replaced by the rockylinux appstream, and exclusions are lost. that could make the difference between a failing or passing migration at the distro-sync stage.
Copying existing per-repository exclusions does sound like a good addition to migrate2rocky.sh
and has the potential to prevent dnf reposync
failures that could be avoided otherwise.