libc icon indicating copy to clipboard operation
libc copied to clipboard

static binary crashing with NULL symbol

Open clnperez opened this issue 4 years ago • 21 comments

target: ppc64le-unknown-linux-gnu & stable-x86_64-unknown-linux-gnu

a statically compiled kata agent binary crashes in what looks to be a dynamic-library-related function. The binaries were built using the latest libc so that we could pick up the fix for statically compiling on non-x86 architectures (https://github.com/rust-lang/libc/pull/2046).

However, we are seeing this: dl-call-libc-early-init.c:37: _dl_call_libc_early_init: Assertion sym != NULL failed

The place @amulyam24 has narrowed it down to is possibly in this openpty call: https://github.com/kata-containers/kata-containers/blob/main/src/agent/rustjail/src/container.rs#L881

I was once stuck on an issue with go that involved ptys, and in that case TCSETS and TCGETS values were wrong for ppc64le. See https://github.com/golang/go/issues/19560. So I can't help but be suspicious, but the dl_ part of this makes me think it's probably not related.

Unfortunately, there's not a lot of info in the logs, and we haven't been able to find a small recreator. I'm opening this here in hopes that someone has some tips for getting more information.

kata-issue-1387-trace.out.txt

clnperez avatar Feb 02 '21 23:02 clnperez

fyi @fidencio

clnperez avatar Feb 04 '21 20:02 clnperez

/cc @Jakob-Naucke

fidencio avatar Feb 04 '21 20:02 fidencio

Hmm @clnperez. Just dumping a few thoughts: Interestingly enough, openpty is also the call that I tracked #2033 (now fixed) down to. That function _dl_call_libc_early_init uses some static addresses (mirror here), but I don't see why that would be an issue on its own. However, it is relatively new (glibc 2.32 I believe), is this limited to more recent glibc? It's a bit surprising to me since the osbuilder distros are all older than that (unless you upgraded?) and the container libc shouldn't affect this (or should it? From your logs, it appears to be before the pivot_root/execvp into the container.)

Jakob-Naucke avatar Feb 05 '21 14:02 Jakob-Naucke

@tuliom FYI

Thanks @Jakob-Naucke. I am still using the older glibc in the default osbuilder distros. I think @amulyam24 may have tried a newer distro. IIUC, the container libc wouldn't unless this was dynamically compiled either way -- but I could be wrong!

However, it is relatively new (glibc 2.32 I believe), is this limited to more recent glibc?

Can you clarify what you're referring to here when you say "it is relatively new?"

clnperez avatar Feb 08 '21 15:02 clnperez

@clnperez I meant: The file elf/dl-call-libc-early-init.c is in glibc 2.32, but not in glibc 2.31. Your error message refers to this file, but Kata's default osbuilder distros are older than that (CentOS 7 is on 2.17, Debian 9 is on 2.24, Fedora 30 is on 2.29, openSUSE 15 is on 2.26, Ubuntu 18.04 is on 2.30). I was wondering if this issue only occurred with osbuilder distros newer than that. Since you're saying you haven't upgraded: If you haven't already, would you mind trying a container image that uses glibc <2.32, such as ubuntu:latest (20.04)? Maybe it does have something to do with the container glibc. If the error message persists although both guest and container use glibc older than 2.32, I'd be really confused.

Jakob-Naucke avatar Feb 08 '21 16:02 Jakob-Naucke

Since you're saying you haven't upgraded: If you haven't already, would you mind trying a container image that uses glibc <2.32, such as ubuntu:latest (20.04)? Maybe it does have something to do with the container glibc.

@Jakob-Naucke, to confirm, the error persists irrespective of the glibc version used by the guest or the container(Tried with glibc<2.32 in combination of guest OS - Fedora 30 - glibc 2.29 + container - Ubuntu 18.04 - glibc 2.27).

Amulyam24 avatar Feb 09 '21 07:02 Amulyam24

Ah, it probably stems from the build host glibc (specifically due to the static linkage). I don't have an idea about the underlying issue though.

Jakob-Naucke avatar Feb 09 '21 10:02 Jakob-Naucke

Ok, so, it turns out we did have a small recreator for this (thanks @Amulyam24). I just hadn't run it on a different host with a downlevel gcc. I can recreate this on my laptop (fedora 33) and a ppc64le system, so it at least isn't a power-only problem.

So I compiled it locally (fc33 on x86), and copied it into a fc32 container to run.

> rustup show
Default host: x86_64-unknown-linux-gnu
rustup home:  /home/christy/.rustup

stable-x86_64-unknown-linux-gnu (default)
rustc 1.50.0 (cb75ad5db 2021-02-10)
use libc;
use std::{mem,ptr};
fn main() {
    let mut slave = mem::MaybeUninit::<libc::c_int>::uninit();
    let mut master = mem::MaybeUninit::<libc::c_int>::uninit();
    let p;
    unsafe {
       p = libc::openpty(
            master.as_mut_ptr(),
            slave.as_mut_ptr(),
            ptr::null_mut(),
            ptr::null_mut(),
            ptr::null_mut()
        );
    }
    println!("p:{}",p);
}

RUSTFLAGS="-C target-feature=+crt-static" cargo build

FROM fedora:32

RUN dnf install -y glibc
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
#RUN source $HOME/.cargo/env
ENV PATH="/root/.cargo/bin:$PATH"
RUN echo $PATH
# compiled with RUSTFLAGS="-C target-feature=+crt-static"
COPY openpty/target/debug/openpty .
RUN  ./openpty
> docker build -t openpty:fc32 .
Sending build context to Docker daemon  23.53MB
Step 1/7 : FROM fedora:32
 ---> eb7f88a194d8
Step 2/7 : RUN dnf install -y glibc
 ---> Using cache
 ---> 5ea930089860
Step 3/7 : RUN curl https://sh.rustup.rs -sSf | sh -s -- -y
 ---> Using cache
 ---> 151e71ed5a86
Step 4/7 : ENV PATH="/root/.cargo/bin:$PATH"
 ---> Using cache
 ---> 1a2d67e93c69
Step 5/7 : RUN echo $PATH
 ---> Using cache
 ---> bdd9afd9e135
Step 6/7 : COPY openpty/target/debug/openpty .
 ---> 91e228a9c6be
Step 7/7 : RUN  ./openpty
 ---> Running in 5506ca1edb92
openpty: dl-call-libc-early-init.c:37: _dl_call_libc_early_init: Assertion `sym != NULL' failed.

Just to make sure:

> ldd openpty/target/debug/openpty 
        not a dynamic executable

After talking with someone on our toolchain team, she suggested I look for anything calling _dl_open.

> nm -an openpty/target/debug/openpty  | grep dl_open$
0000000000493280 T _dl_open

I'm not sure how to track that down. Hopefully some of this is helpful for some more digging at least.

clnperez avatar Mar 19 '21 19:03 clnperez

It really seems this only happens with more recent glibc. While I didn't have the time for an exact bisection, I can e.g. reproduce this on Ubuntu 20.10 (2.32) and an Ubuntu 18.04 container (2.27), but I cannot reproduce it on RHEL 8.3 (2.28) and even the most archaic containers like CentOS 6 (2.12). (I used up-to-date 1.51 from Rustup and libc 0.2.92 in both cases.)

Jakob-Naucke avatar Mar 31 '21 15:03 Jakob-Naucke

@Jakob-Naucke -- I may be misunderstanding your last comment. I thought that you'd mostly root-caused this in this comment that it would only happen with glibc 2.32 and later if built using an earlier glibc. But I'm also surprised that you could reproduce it with 2.27 in the Ubuntu 18.04 container.

clnperez avatar Apr 23 '21 16:04 clnperez

if built using an earlier glibc

@clnperez No, I meant building it on a more recent glibc than running it all along. But my comment was very confusing :slightly_smiling_face:

Poking around a little more, I found that this happens when building with glibc >= 2.32 and running on glibc <= 2.31. I think it's also noteworthy that when you build dynamically instead, you will get the error

/lib64/libc.so.6: version `GLIBC_2.32' not found

The error _dl_call_libc_early_init: Assertion sym != NULL failed is only observed when building statically. However, you can e.g. build on 2.31 and run on 2.28, so not every version mismatch is fatal.

Jakob-Naucke avatar Apr 29 '21 10:04 Jakob-Naucke

This is reproducible without Rust:

#include <stdio.h>
#include <stdlib.h>
#include <pty.h>

int main() {
	int* child = malloc(sizeof(int));
	int* parent = malloc(sizeof(int));
	int p = openpty(parent, child, 0, 0, 0);
	printf("p:%d\n", p);
	free(child);
	free(parent);
}

built with gcc openpty.c -lutil -static on a 2.32 system and run on 2.31:

a.out: dl-call-libc-early-init.c:37: _dl_call_libc_early_init: Assertion `sym != NULL' failed.
Aborted (core dumped)

Jakob-Naucke avatar Apr 29 '21 10:04 Jakob-Naucke

Thank you @Jakob-Naucke ! I reported this issue to glibc here: https://sourceware.org/bugzilla/show_bug.cgi?id=27790

I think it's also noteworthy that when you build dynamically instead, you will get the error /lib64/libc.so.6: version `GLIBC_2.32' not found

Running a program dynamically linked against a newer glibc on a system with an older glibc is unsupported because the older glibc can't support all the features of the new one. IMHO, the real question is in the static case which does make usage of dynamic linking via dl_open(). I prefer to collect more details before giving you an answer.

tuliom avatar Apr 29 '21 12:04 tuliom

From @tuliom in Bugzilla:

Interestingly, I can't reproduce this issue when the binary is statically linked against glibc 2.33 and executed on 2.31.

@clnperez I can reproduce this with Rust too. Building the Rust example against 2.33 and linking statically, the code runs just fine. So I think this is an issue with glibc 2.32 and no version before or after.

Jakob-Naucke avatar Apr 29 '21 15:04 Jakob-Naucke

Thanks guys. It's good to know it's not just rust.

And @Jakob-Naucke, I flipped the versions around in my comment . So that only added to the confusion here. :D It does seem to be only related to that one add in glibc 2.32.

clnperez avatar Apr 29 '21 16:04 clnperez

Statically linked glibc binaries aren't very portable. In general, you need to run them on a system with exactly the same glibc version. (Dynamically linked binaries can run on newer glibc versions, too.) In glibc 2.33, the behavior of openpty changed due to the commit, Linux: Require properly configured /dev/pts for PTYs, so openpty is safer to use in statically linked programs.

fweimer-rh avatar Apr 29 '21 19:04 fweimer-rh

@fweimer-rh -- That seems to indicate that there shouldn't be static agent binaries for kata at all then? /cc @fidencio

clnperez avatar Apr 29 '21 21:04 clnperez

@clnperez I don't know your exact requirements. If you need static binaries for isolation from the host environment, you need to stick to a certain subset if using glibc. It so happens that openpty is in that subset only starting with glibc 2.33. There is no good way to check whether an application sticks to the subset, as every static link currently pulls in the dynamic loader (so its presence in the static binary does not tell you anything about compatibility). We're making slow progress towards improved static linking, but other things have higher priority for upstream work.

Using one of the smaller libcs instead might be a better alternative for you. The other alternative would be to inject the application along with its own dynamically-linked glibc.

fweimer-rh avatar Apr 30 '21 05:04 fweimer-rh

Thanks @fweimer-rh Sounds like we should close this as a known limitation then? Also happy to keep it open and test as progress towards improved static linking is made.

clnperez avatar May 03 '21 18:05 clnperez

building it against musl libc should solve the issue ref: https://www.graalvm.org/22.0/reference-manual/native-image/StaticImages/#preparation

belloyang avatar Jan 26 '24 17:01 belloyang

This issue can be closed?

This was a helpful reference for me to better understand when static linking to glibc breaks beyond the more commonly cited examples.

polarathene avatar Mar 11 '24 08:03 polarathene

Thanks for the updates, I'll close based on the above. If there is more to figure out here, feel free to create a discussion, or reopen/create a new issue if our libc needs to do something.

tgross35 avatar Aug 29 '24 05:08 tgross35