patchelf icon indicating copy to clipboard operation
patchelf copied to clipboard

ELF load command address/offset not properly aligned

Open ilya-lavrenov opened this issue 1 year ago • 18 comments

Describe the bug

Once 0.18 release is out, our build process started to fail with ELF load command address/offset not properly aligned. We use patchelf inside to add RPATH on Linux systems.

On CentOS all our libraries loading started to fail with such error, on Ubuntu 18.04 the issue is reproduced not in all cases (we don't have more details). But on Ubuntu 20.04 we have not seen any regressions.

Expected behavior

Works as before

patchelf --version output

0.18

ilya-lavrenov avatar Apr 24 '23 16:04 ilya-lavrenov

Same here with ManyLinux containers that are based on AlmaLinux. 0.17.2.1 works, but 0.18 causes "ELF load command address/offset not properly aligned"

vsuorant avatar Apr 24 '23 17:04 vsuorant

We changed some alignment stuff in https://github.com/NixOS/patchelf/pull/475 to fix alignment on arm. Looks like this can cause regressions with older glibc versions? Can you be more precise when this happens and how to reproduce this i.e. using docker? cc @brenoguim

Mic92 avatar Apr 24 '23 17:04 Mic92

It's not trivial, but VTK's CI has been affected by this. It's not trivial, but everything is in CI here: https://gitlab.kitware.com/vtk/vtk/-/jobs/8115134. The only difference that is meaningful to the error (to a first approximation) is a patchelf bump (see the issue I had filed above).

IIUC, patchelf is used to stuff non-blessed libraries into Python wheels so that they work "everywhere" given the limited set of libraries/ABIs PyPI can expect to exist in arbitrary Linux machines. DT_SONAME, DT_RUNPATH, and DT_NEEDED entries are all affected (the last to sync with the first's changes) before copying into the wheel. This may change section sizes.

I suspect just getting any old project that compiles C or C++ code, uses some "weird" external library, and puts that into a wheel using auditwheel will show this problem when trying to use said wheel.

mathstuf avatar Apr 24 '23 18:04 mathstuf

I suspect just getting any old project that compiles C or C++ code, uses some "weird" external library, and puts that into a wheel using auditwheel will show this problem when trying to use said wheel.

We ran into this with drake, I restored our old functionality of building patchelf from source in https://github.com/RobotLocomotion/drake/pull/19265 to be able to help test if desired. There are instructions on the PR of how to do the build, but I do not think drake will be a convenient codebase for you all to try and identify what needs to be fixed, since iterating development will be very slow. That said, if you think you have something working and a commit is pushed somewhere, I can fairly easily run a canary build to see if the new change is working as desired. Hope that helps some!

svenevs avatar Apr 24 '23 20:04 svenevs

It looks like in https://github.com/NixOS/patchelf/pull/494 it's only breaking on arm64/s390x for centos. What cpu arch are you on?

Mic92 avatar Apr 29 '23 15:04 Mic92

@Mic92, I saw the issue on x86_64. With #494, it just means that there are no tests that show the specific issue just now (or they only appear on arm64/s390x for some reason in the tests). #494 is meant for regressions not to be introduced once this issue is fixed (& a test added) since it only happens with some distros.

mayeut avatar Apr 29 '23 16:04 mayeut

Ubuntu 18.04 x86_64 fails with the same message in multiple tests: https://github.com/NixOS/patchelf/actions/runs/4845550763/jobs/8634531910

mayeut avatar Apr 30 '23 17:04 mayeut

This issue has broken conda-build for me, which I guess is calling patchelf? I'm on a centos x86_64 machine.

jacobwilliams avatar May 12 '23 00:05 jacobwilliams

Yep, we're also seeing failures here in conda-build and mamba-build. Have pinned to a lower version of patchelf for now.

mzjp2 avatar May 13 '23 12:05 mzjp2

It's hard to reproduce this issue, but I have seen the ELF load command address/offset not properly aligned randomly in our builds. Here's a way to reproduce a broken library, but not necessarily the same issue.

mkdir tmp && cd tmp
wget https://anaconda.org/conda-forge/cuda-cudart_linux-64/12.0.107/download/noarch/cuda-cudart_linux-64-12.0.107-h59595ed_4.conda
unzip cuda-cudart_linux-64-12.0.107-h59595ed_4.conda
rm -rf targets
tar -xvf pkg-cuda-cudart_linux-64-12.0.107-h59595ed_4.tar.zst

for i in 1 2 3 4 5 6 7 8 9 10; do
  patchelf --add-rpath '$ORIGIN../'"$i" ./targets/x86_64-linux/lib/libcudart.so.12
  patchelf --print-rpath ./targets/x86_64-linux/lib/libcudart.so.12
  python -c "import ctypes; ctypes.CDLL('./targets/x86_64-linux/lib/libcudart.so.12.0.107')"
done

isuruf avatar May 24 '23 14:05 isuruf

To add some complexity, I think Apple's Rosetta 2 has some differences from Linux/glibc wrt ELF interpretation — so if you're executing patchelf'd amd64 Linux binaries under macOS Docker on Apple Silicon (where amd64 ELF binaries are executed/translated by Rosetta 2 using binfmts); you may see different behaviours than a native amd64 Linux OS.

In particular, we've seen some weird issues: segfaults; or messing up dynamic libraries: trying to load lib instead of libwhatever — and patchelf is involved. I don't have a simple reproducer yet, and I don't think it's necessarily related to this issue — but more a heads up that if someone is trying to reproduce amd64 ELF issues under Docker on macOS on Apple Silicon you may get very different results.

rcoup avatar May 24 '23 20:05 rcoup

@Mic92

Issue 100% reproducible in unit tests when running under Rocky 8 docker:

docker run -it --rm -w $(pwd) -v $(pwd):$(pwd) rockylinux:8.8.20230518 bash -c 'dnf install -y gcc gcc-c++ make autoconf automake libacl-devel libattr-devel diffutils chrpath && ./bootstrap.sh && cd build && make check || (cat tests/*.log; exit 1)'

Example output (partial):

# Run the patched tool and libraries
./many-syms-main: error while loading shared libraries: libmany-syms.so: ELF load command address/offset not properly aligned
FAIL rename-dynamic-symbols.sh (exit status: 127)

da-x avatar Jun 18 '23 03:06 da-x

I'm also seeing mkfs.ext4 segfaults after calling patchelf --set-interpreter multiple times with 0.18.0 version (while 0.17.2 version worked fine), I've uploaded simple reproducer test here: https://github.com/shr-project/patchelf/commits/jansa/mkfs.ext4.segfaults

Reverting 65cdee904431d16668f95d816a495bc35a05a192 fixes this test.

shr-project avatar Jun 22 '23 15:06 shr-project

I'll be able to look into these next week. With a reproducer it should be quick to debug!

brenoguim avatar Jun 23 '23 00:06 brenoguim

As a work around I'm using --print-interpreter to check current interpreter before trying to change it to avoid at least the unnecessary --set-interpreter calls when the interpreter is already set to the requested value, maybe this "optimalization" could be implemented in patchelf directly as well?

https://lists.openembedded.org/g/openembedded-core/message/183314

It won't fix the reproducer as it's setting different values in the loop, but might help avoiding some unnecessary binary modifications.

shr-project avatar Jun 28 '23 06:06 shr-project

I found that this was probably due to a bug in glibc earlier than 2.35. If fixing it on the patchelf side, #510 should be available.

yuta-hayama avatar Jul 31 '23 05:07 yuta-hayama

Thanks @yuta-hayama for looking into this.

With both your PRs, I see the repeated-set-interpreter mkfs test:

Segments before: 2 and after: 103

patchelf/tests $ ldd scratch/repeated-set-interpreter/mkfs.ext4
        linux-vdso.so.1 (0x00007ffdf1c09000)
        libext2fs.so.2 => /usr/lib64/libext2fs.so.2 (0x00007f86de17a000)
        libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f86f747a000)
        libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f86f7408000)
        libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f86de16f000)
        libe2p.so.2 => /usr/lib64/libe2p.so.2 (0x00007f86de162000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f86f7401000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f86ddf8f000)
        /short => /lib64/ld-linux-x86-64.so.2 (0x00007f86f74ab000)

patchelf/tests $ scratch/repeated-set-interpreter/mkfs.ext4
bash: scratch/repeated-set-interpreter/mkfs.ext4: cannot execute binary file: Exec format error

Even after fixing the interpreter:

patchelf/tests $ ../src/patchelf --set-interpreter /lib64/ld-linux-x86-64.so.2 scratch/repeated-set-interpreter/mkfs.ext4
patchelf/tests $ scratch/repeated-set-interpreter/mkfs.ext4
bash: scratch/repeated-set-interpreter/mkfs.ext4: cannot execute binary file: Exec format error
patchelf/tests $ ldd scratch/repeated-set-interpreter/mkfs.ext4
        linux-vdso.so.1 (0x00007ffc2a189000)
        libext2fs.so.2 => /usr/lib64/libext2fs.so.2 (0x00007ff66240e000)
        libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007ff662407000)
        libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007ff648f8e000)
        libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007ff648f83000)
        libe2p.so.2 => /usr/lib64/libe2p.so.2 (0x00007ff648f76000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007ff648f71000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007ff648d9e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ff6624be000)

This is with glibc-2.37-r4 from gentoo, I haven't tried on 18.04 ubuntu yet, but the original issue with openembedded uninative builds should be resolved with your #508 (as I was using this work around https://lists.openembedded.org/g/openembedded-core/message/183314 as well). So thank you again for implementing this.

shr-project avatar Jul 31 '23 11:07 shr-project

Hello all, I see various suggested patches for this issue. Any chance of some combo of them getting merged and a new release being cut, so downstream packagers don't have to worry about this?

satmandu avatar Sep 09 '24 11:09 satmandu