easybuild-easyconfigs
easybuild-easyconfigs copied to clipboard
XZ: patch for symbols still needed?
In https://github.com/easybuilders/easybuild-easyconfigs/issues/4036 an issue was identified with CentOS 7.3 due to certain missing version symbols in the XZ library. A patch was then included in the XZ easyconfigs that removes the XZ_5.2
symbol, tweaks the XZ_5.0
symbol, and adds XZ_5.1.2alpha
and XZ_5.2.2
symbols. This patch is still applied in the most current easyconfig, for XZ 5.2.5. This patch have fixed the issue for CentOS, but on RHEL 8.4 this now causes an issue with system-wide libarchive.so (and we had it on our earlier system as well):
snellius paulm@int3 12:22 ~$ cmake --version
cmake version 3.18.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
snellius paulm@int3 12:22 ~$ module load 2021 XZ/5.2.5-GCCcore-10.3.0
snellius paulm@int3 12:23 ~$ cmake --version
cmake: /sw/arch/Centos8/EB_production/2021/software/XZ/5.2.5-GCCcore-10.3.0/lib/liblzma.so.5: version `XZ_5.2' not found (required by /lib64/libarchive.so.13)
This is caused by the system liblzma.so.5 including the symbols for 5.0 and 5.2:
snellius paulm@int3 12:25 ~$ nm -D /lib64/liblzma.so.5 | grep XZ
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.2
While the (patched) EB library is missing the XZ_5.2 one:
snellius paulm@int3 12:25 ~$ nm -D /sw/arch/Centos8/EB_production/2021/software/XZ/5.2.5-GCCcore-10.3.0/lib/liblzma.so.5 | grep XZ
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.1.2alpha
0000000000000000 A XZ_5.2.2
Having system-wide binaries getting broken due loading an EB module isn't very nice. In this case I wonder if the XZ symbol patch is still needed? Also, because removing the XZ_5.2
symbol as it does, while leaving XZ_5.0
in place, feels very wrong.
This bug (#4036) was also found in Maneage [1], a template for reproducible science research papers aiming at a rigorous set of reproducibility criteria [2]. The bug report is at [3]. Maneage is intended to be reproducible on any Unix-like OS, so we don't want a hack that's specific to Redhat alone, or even worse, just a few particular versions of Redhat.
As a possible answer to the question of whether a hack is "still needed", it seems from Maneage 27ff6f7 that cmake-3.21.4 does still need either the known Redhat-specific(?) workaround, or an alternative workaround.
[1] https://maneage.org [2] https://doi.org/10.1109/MCSE.2021.3072860 [3] https://savannah.nongnu.org/bugs/index.php?62700
I think the patch is still necessary (only) for CentOS 7, but I also think the patch should be modified to create symbols for XZ_5.2
rather than XZ_5.2.2
as it is causing problems on other OSes and OS versions.
The original patch is at https://git.centos.org/rpms/xz/blob/c7/f/SOURCES/xz-5.2.2-compat-libs.patch
@broukema Does the updated patch in https://github.com/easybuilders/easybuild-easyconfigs/pull/15856 solve your issue?
I'm going to test it on a CentOS 7 system and on a different sytem, but I've also asked about this upstream to the 'xz' developers at #tukaani on Libera Chat. There's no point having 'easy' builds without getting things fixed properly upstream, where the community for a package is best likely to understand what the best solution is: https://tukaani.org/contact.html
@broukema To be fair, I don't think the problem is the developers, CentOS7 were the ones who created the issue (and they fixed it with a patch). The problem for EasyBuild is that we want the fix for CentOS7 but also need the same patch to work on all OSes (including all the ones who don't need the patch at all).
Even though this looks like a CentOS7 specific problem, I assume that #tukaani/xz developers have an interest in keeping xz portable, and in keeping widely used programs that depend on xz portable in their dependence on xz, even if, as seems to be the case here, those programs made some non-portable or non-sustainable hacks. CentOS7 systems will gradually be superseded, but they won't disappear suddenly.
As I understand it, both cmake and openssl (through openssl-devel) created dependences on versions or functions in ABIs that they were not supposed to create. If we knew the list of all programs that incorrectly depend on non-maintainable parts of the xz/libzma ABI in the CentOS7 context, and if we could persuade all their developers to fix this dependence, then that would, I presume, be an alternative to xz making the hack proposed here. I guess CentOS7 developers are the ones who should make the effort of getting cmake, openssl, and any other affected packages fixed.
In either case, having a list of known hacks in 'easybuild-easyconfigs' is a useful workaround (and it's where I found info that seems to handle this bug), but it's not a sustainable way to develop the FOSS ecosystem. The fixes need to be handled in the parts of the system where people most expect them to be handled, and where expertise and understanding on a package is coordinated - which is upstream in the particular affected packages.
The patch proposed in #15856 in the form [1], solves the bug on a CentOS7 system, and appears not to cause a bug on a Debian GNU/Linux 11.3 system [2]. Within Maneage [3], we cannot use a patch directly, because xz is part of the "basic" build system, when higher-level tools such as patch have not yet been compiled. So a patched file is used directly.
Thanks! :)
We'll hopefully see soon if xz developers have any comments or recommendations.
[1] https://codeberg.org/boud/maneage_dev/commit/25539cd69cdbd2826482894dc64804f0830dbbec [2] https://savannah.nongnu.org/bugs/index.php?62700 [3] https://maneage.org
As I understand it, both cmake and openssl (through openssl-devel) created dependences on versions or functions in ABIs that they were not supposed to create.
Would you have more details on this?
- cmake:
- https://savannah.nongnu.org/bugs/index.php?62700
- your own bug report above here on https://github.com/easybuilders/easybuild-easyconfigs/issues/14991
- openssl:
- https://github.com/easybuilders/easybuild-easyconfigs/issues/4036
Issue #4036 is from 2017, which is what makes me think that upstream (either xz or (cmake + openssl + others)) wasn't/weren't informed. Five years is usually plenty for handling a bug with an apparently simple solution for reasonably well-maintained packages like these.
I'm still not convinced there is any blame on cmake or openssl developers, as you claim. The problem with these symbols almost exclusive seems to stem from using RHEL packages on a CentOS system, for specific OS version combinations only (see also this extensive presentation on this issue, starting slide 9). I don't see how this issue can prop up when compiling cmake/openssl/
Also note that the XZ-related symbols in liblzma.so
of current distros seems to align (this is with systems I have access to, but I don't expect surprises here, as these are the symbols to logically be included for XZ 5.2.x):
Arch
xz 5.2.5-3
melis@juggle 14:08:~$ nm -D /usr/lib/liblzma.so | grep " XZ"
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.2
Debian GNU/Linux 10
paulm@login4 14:07 ~$ nm -D /lib/x86_64-linux-gnu/liblzma.so.5 | grep XZ
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.2
RHEL 8.2 EUS
snellius paulm@int3 14:08 ~$ nm -D /lib64/liblzma.so.5 | grep " XZ"
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.2
Ubuntu 18.04.6 LTS
pmelis@rsc-instance:~$ nm -D /lib/x86_64-linux-gnu/liblzma.so.5 | grep " XZ"
0000000000000000 A XZ_5.0
0000000000000000 A XZ_5.2
So this makes it even weirder that EB would start to screw up semantically correct versioning symbols, in order to introduce a fix for only a single OS version (Centos 7), making everybody suffer. Hence, the original reason for this issue report.
The Fermilab pdf [1] is very useful - thanks! I can't test the openssl-devel case, but the cmake case is the one that I found and could, in principle, test further. I think that page 27 picks out something that has been missing from this discussion so far: "The liblzma library didn't specifically choose the new or old symbol. It just used the pthread headers on the system."
Trying to trace this, libpthread on RH doesn't make it easy to check the glibc version:
# CentOS 7.9.2009
$ nm -D /lib64/libpthread.so.0|grep pthread_sigmask
00000000xxxxxxxx T pthread_sigmask
# Debian 11.3
$ nm -D /lib/x86_64-linux-gnu/libpthread.so.0 |grep pthread_sigmask
00000000xxxxxxxx T pthread_sigmask@@GLIBC_2.2.5
Debian makes dependency tracing easier, e.g. around pthread_getaffinity:
# CentOS 7.9.2009
$ nm -D /lib64/libpthread.so.0|grep -C2 pthread_getaffinity
00000000xxxxxxxx T pthread_equal
00000000xxxxxxxx T pthread_exit
00000000xxxxxxxx T pthread_getaffinity_np
00000000xxxxxxxx T pthread_getaffinity_np
00000000xxxxxxxx T pthread_getattr_np
00000000xxxxxxxx T pthread_getconcurrency
# Debian 11.3
$ nm -D /lib/x86_64-linux-gnu/libpthread.so.0 |grep -C2 pthread_getaffinity
00000000xxxxxxxx W pthread_detach@@GLIBC_2.2.5
00000000xxxxxxxx W pthread_exit@@GLIBC_2.2.5
00000000xxxxxxxx T pthread_getaffinity_np@@GLIBC_2.3.4
00000000xxxxxxxx T pthread_getaffinity_np@GLIBC_2.3.3
00000000xxxxxxxx T pthread_getattr_default_np@@GLIBC_2.18
00000000xxxxxxxx T pthread_getattr_np@@GLIBC_2.2.5
In any case, you do seem to be right: the fault is not that of cmake or openssl, it appears rather to be a pthread/glibc issue. Maneage does currently depend on the system's glibc, so it's less independent from the host system than we would like. We have a task set for installing glibc itself [2].
My guess is that doing a hack of glibc/pthread would be more un-sustainable than the current xz hack, though I'm again just guessing.
[1] https://lss.fnal.gov/archive/2021/slides/fermilab-slides-21-020-scd.pdf [2] https://savannah.nongnu.org/task/?15390
@paulmelis @ocaisa Can this be closed too now that #15856 is merged?
This is currently being discussed upstream for xz on #tukaani on Libera Chat: https://tukaani.org/contact.html .
A few quick points:
- RHEL (Red Hat Enterprise Linux) should not have used
XZ_5.1.2alpha
, a symbol that was purely developmental (experimental), for production level versions of CentOS; - RHEL should not have invented the symbol string
XZ_5.2.2
for an apparently existing library symbol in xz. The risk in the two cases is that people write software that depends on one or both of these.
(edit: fixed my own confusion about library symbols versus overall code versions)
Upstream xz is still working on this - it's not trivial to clean up in a robust, modular, tidy way.
Draft fixes can be seen on the xz
experimental git branch [1]. See commit 913ddc55 'liblzma: Vaccinate against an ill patch from RHEL/CentOS 7' and commit 17485e88. Commit 913ddc55 has a detailed explanation of the problem and the fix. As I understand it, the patch [2] + [3] is likely to be applied in 5.2.7 in the stable branch - whenever that happens to be released.
For Easybuild-Easyconfigs, I would suggest reverting the guess at https://github.com/easybuilders/easybuild-easyconfigs/pull/15856 and implementing the recommended upstream xz
fix instead.
As far as Maneage [4] is concerned, we've separated out the CentOS 7 error in handling xz
[5] from our own bug in handling the build of cmake
[6]. The fix for cmake
is to use --no-system-libs
, which seems to correctly compile cmake's own choice of several libraries, including liblzma
, statically into the cmake
binary. This way the Maneage
reproducibility system is better isolated from the host system and we bypass the CentOS 7 bug; we're not in a hurry to update to newer xz
that solves the CentOS 7 bug, since the bug no longer affects us on CentOS 7, and is unlikely to affect us on other OSes.
[1] https://git.tukaani.org/?p=xz.git;a=summary [2] https://git.tukaani.org/?p=xz.git;a=commit;h=913ddc5572b9455fa0cf299be2e35c708840e922 [3] https://git.tukaani.org/?p=xz.git;a=commit;h=17485e884ce5c74315f29a8a1507bc706cd5cd1d [4] https://maneage.org [5] https://savannah.nongnu.org/bugs/index.php?62700 [6] https://savannah.nongnu.org/bugs/index.php?63043