patchelf
patchelf copied to clipboard
ELF load command address/offset not properly aligned
Describe the bug
Once 0.18 release is out, our build process started to fail with ELF load command address/offset not properly aligned
. We use patchelf inside to add RPATH on Linux systems.
On CentOS all our libraries loading started to fail with such error, on Ubuntu 18.04 the issue is reproduced not in all cases (we don't have more details). But on Ubuntu 20.04 we have not seen any regressions.
Expected behavior
Works as before
patchelf --version
output
0.18
Same here with ManyLinux containers that are based on AlmaLinux. 0.17.2.1 works, but 0.18 causes "ELF load command address/offset not properly aligned"
We changed some alignment stuff in https://github.com/NixOS/patchelf/pull/475 to fix alignment on arm. Looks like this can cause regressions with older glibc versions? Can you be more precise when this happens and how to reproduce this i.e. using docker? cc @brenoguim
It's not trivial, but VTK's CI has been affected by this. It's not trivial, but everything is in CI here: https://gitlab.kitware.com/vtk/vtk/-/jobs/8115134. The only difference that is meaningful to the error (to a first approximation) is a patchelf
bump (see the issue I had filed above).
IIUC, patchelf
is used to stuff non-blessed libraries into Python wheels so that they work "everywhere" given the limited set of libraries/ABIs PyPI can expect to exist in arbitrary Linux machines. DT_SONAME
, DT_RUNPATH
, and DT_NEEDED
entries are all affected (the last to sync with the first's changes) before copying into the wheel. This may change section sizes.
I suspect just getting any old project that compiles C or C++ code, uses some "weird" external library, and puts that into a wheel using auditwheel
will show this problem when trying to use said wheel.
I suspect just getting any old project that compiles C or C++ code, uses some "weird" external library, and puts that into a wheel using auditwheel will show this problem when trying to use said wheel.
We ran into this with drake
, I restored our old functionality of building patchelf
from source in https://github.com/RobotLocomotion/drake/pull/19265 to be able to help test if desired. There are instructions on the PR of how to do the build, but I do not think drake
will be a convenient codebase for you all to try and identify what needs to be fixed, since iterating development will be very slow. That said, if you think you have something working and a commit is pushed somewhere, I can fairly easily run a canary build to see if the new change is working as desired. Hope that helps some!
It looks like in https://github.com/NixOS/patchelf/pull/494 it's only breaking on arm64/s390x for centos. What cpu arch are you on?
@Mic92, I saw the issue on x86_64. With #494, it just means that there are no tests that show the specific issue just now (or they only appear on arm64/s390x for some reason in the tests). #494 is meant for regressions not to be introduced once this issue is fixed (& a test added) since it only happens with some distros.
Ubuntu 18.04 x86_64 fails with the same message in multiple tests: https://github.com/NixOS/patchelf/actions/runs/4845550763/jobs/8634531910
This issue has broken conda-build for me, which I guess is calling patchelf? I'm on a centos x86_64 machine.
Yep, we're also seeing failures here in conda-build
and mamba-build
. Have pinned to a lower version of patchelf
for now.
It's hard to reproduce this issue, but I have seen the ELF load command address/offset not properly aligned
randomly in our builds. Here's a way to reproduce a broken library, but not necessarily the same issue.
mkdir tmp && cd tmp
wget https://anaconda.org/conda-forge/cuda-cudart_linux-64/12.0.107/download/noarch/cuda-cudart_linux-64-12.0.107-h59595ed_4.conda
unzip cuda-cudart_linux-64-12.0.107-h59595ed_4.conda
rm -rf targets
tar -xvf pkg-cuda-cudart_linux-64-12.0.107-h59595ed_4.tar.zst
for i in 1 2 3 4 5 6 7 8 9 10; do
patchelf --add-rpath '$ORIGIN../'"$i" ./targets/x86_64-linux/lib/libcudart.so.12
patchelf --print-rpath ./targets/x86_64-linux/lib/libcudart.so.12
python -c "import ctypes; ctypes.CDLL('./targets/x86_64-linux/lib/libcudart.so.12.0.107')"
done
To add some complexity, I think Apple's Rosetta 2 has some differences from Linux/glibc wrt ELF interpretation — so if you're executing patchelf'd amd64 Linux binaries under macOS Docker on Apple Silicon (where amd64 ELF binaries are executed/translated by Rosetta 2 using binfmts); you may see different behaviours than a native amd64 Linux OS.
In particular, we've seen some weird issues: segfaults; or messing up dynamic libraries: trying to load lib
instead of libwhatever
— and patchelf is involved. I don't have a simple reproducer yet, and I don't think it's necessarily related to this issue — but more a heads up that if someone is trying to reproduce amd64 ELF issues under Docker on macOS on Apple Silicon you may get very different results.
@Mic92
Issue 100% reproducible in unit tests when running under Rocky 8 docker:
docker run -it --rm -w $(pwd) -v $(pwd):$(pwd) rockylinux:8.8.20230518 bash -c 'dnf install -y gcc gcc-c++ make autoconf automake libacl-devel libattr-devel diffutils chrpath && ./bootstrap.sh && cd build && make check || (cat tests/*.log; exit 1)'
Example output (partial):
# Run the patched tool and libraries
./many-syms-main: error while loading shared libraries: libmany-syms.so: ELF load command address/offset not properly aligned
FAIL rename-dynamic-symbols.sh (exit status: 127)
I'm also seeing mkfs.ext4 segfaults after calling patchelf --set-interpreter multiple times with 0.18.0 version (while 0.17.2 version worked fine), I've uploaded simple reproducer test here: https://github.com/shr-project/patchelf/commits/jansa/mkfs.ext4.segfaults
Reverting 65cdee904431d16668f95d816a495bc35a05a192 fixes this test.
I'll be able to look into these next week. With a reproducer it should be quick to debug!
As a work around I'm using --print-interpreter to check current interpreter before trying to change it to avoid at least the unnecessary --set-interpreter calls when the interpreter is already set to the requested value, maybe this "optimalization" could be implemented in patchelf directly as well?
https://lists.openembedded.org/g/openembedded-core/message/183314
It won't fix the reproducer as it's setting different values in the loop, but might help avoiding some unnecessary binary modifications.
I found that this was probably due to a bug in glibc earlier than 2.35. If fixing it on the patchelf side, #510 should be available.
Thanks @yuta-hayama for looking into this.
With both your PRs, I see the repeated-set-interpreter mkfs test:
Segments before: 2 and after: 103
patchelf/tests $ ldd scratch/repeated-set-interpreter/mkfs.ext4
linux-vdso.so.1 (0x00007ffdf1c09000)
libext2fs.so.2 => /usr/lib64/libext2fs.so.2 (0x00007f86de17a000)
libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f86f747a000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f86f7408000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f86de16f000)
libe2p.so.2 => /usr/lib64/libe2p.so.2 (0x00007f86de162000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f86f7401000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f86ddf8f000)
/short => /lib64/ld-linux-x86-64.so.2 (0x00007f86f74ab000)
patchelf/tests $ scratch/repeated-set-interpreter/mkfs.ext4
bash: scratch/repeated-set-interpreter/mkfs.ext4: cannot execute binary file: Exec format error
Even after fixing the interpreter:
patchelf/tests $ ../src/patchelf --set-interpreter /lib64/ld-linux-x86-64.so.2 scratch/repeated-set-interpreter/mkfs.ext4
patchelf/tests $ scratch/repeated-set-interpreter/mkfs.ext4
bash: scratch/repeated-set-interpreter/mkfs.ext4: cannot execute binary file: Exec format error
patchelf/tests $ ldd scratch/repeated-set-interpreter/mkfs.ext4
linux-vdso.so.1 (0x00007ffc2a189000)
libext2fs.so.2 => /usr/lib64/libext2fs.so.2 (0x00007ff66240e000)
libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007ff662407000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007ff648f8e000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007ff648f83000)
libe2p.so.2 => /usr/lib64/libe2p.so.2 (0x00007ff648f76000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007ff648f71000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007ff648d9e000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff6624be000)
This is with glibc-2.37-r4 from gentoo, I haven't tried on 18.04 ubuntu yet, but the original issue with openembedded uninative builds should be resolved with your #508 (as I was using this work around https://lists.openembedded.org/g/openembedded-core/message/183314 as well). So thank you again for implementing this.
Hello all, I see various suggested patches for this issue. Any chance of some combo of them getting merged and a new release being cut, so downstream packagers don't have to worry about this?