static-ffmpeg icon indicating copy to clipboard operation
static-ffmpeg copied to clipboard

librav1e produces segfault

Open tatref opened this issue 1 year ago • 26 comments

Trying to encode a file into av1 with librav1e results in segfault

podman run -it --rm -v "$PWD:$PWD" -w "$PWD" docker.io/mwader/static-ffmpeg -v debug -i PXL_20240630_122849440.mp4 -c:v librav1e 'PXL_20240630_122849440.av1.mp4'; echo $?

There is no output, but dmesg shows:

librav1e[2506]: segfault at 0 ip 00007f7e713a5b48 sp 00007f7e6b3ac0f0 error 6 in ffmpeg[7f7e6d895000+5b11000] likely on CPU 3 (core 3, socket 0)

Copying the ffmpeg bin from container to host results in the same error.

tatref avatar Jul 05 '24 17:07 tatref

Hey, interesting. First step i think would be to see if one can reproduce the problem with some ffmpeg linked with glibc, if so i would guess it's a ffmpeg or librav1e bug somehow, if not we have to figure out what difference musl etc does.

Are you able to share PXL_20240630_122849440.mp4, some small cuts of it or some other video that reproduces the problem?

btw does -v trace give any hints what is going on before is crahes?

wader avatar Jul 05 '24 21:07 wader

I have tested multiple input files (hevc, x264, and xvid), all of them produces a crash. Encoding to x264 is OK. So I think the issue is with librav1e

-v trace doesn't produce anything useful.

Running with gdb gives:

Thread 9 "enc0:0:librav1e" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 204808]
0x00007ffff2ba5b48 in ?? ()
(gdb) bt
#0  0x00007ffff2ba5b48 in ?? ()
#1  0x00007ffff2b44cf9 in ?? ()
#2  0x00007ffff2b43f83 in ?? ()
#3  0x00007ffff2b45c20 in ?? ()
#4  0x00007ffff2a9f612 in ?? ()
#5  0x00007ffff2a9baa0 in ?? ()
#6  0x00007ffff2a89c5b in rav1e_send_frame ()
#7  0x00007fffef9a9f4d in ?? ()
#8  0x00007fffef84c7b4 in ?? ()
#9  0x00007fffef84ca34 in avcodec_send_frame ()
#10 0x00007fffef1b60cc in ?? ()
#11 0x00007fffef1b6b81 in encoder_thread ()
#12 0x00007fffef1d1ff9 in ?? ()
#13 0x00007ffff0e9bd75 in ?? ()
#14 0x0000000000000000 in ?? ()

Do you know if it is possible to compile with debug symbols? (not sure if it can be useful)

tatref avatar Jul 06 '24 01:07 tatref

Ok, could you try with alpines own ffmpeg which also has librav1e?

docker run --rm -ti -v "$PWD:$PWD" -w "$PWD" alpine:edge sh -c 'apk add ffmpeg && ffmpeg -i PXL_20240630_122849440.mp4 -c:v librav1e -t 0.1s PXL_20240630_122849440.av1.mp4'

And also some glibc-based distro like debian?

docker run --rm -ti -v "$PWD:$PWD" -w "$PWD" debian:sid sh -c 'apt-get update && apt-get install ffmpeg && ffmpeg -i PXL_20240630_122849440.mp4 -c:v librav1e -t 0.1s PXL_20240630_122849440.av1.mp4'

About debug symbols: yes is possible, remove --disable-debug and maybe prepend ffmpeg configure with CFLAGS="-O0 -ggdb" ./configure .... etc

wader avatar Jul 06 '24 05:07 wader

The Alpine and Debian containers work fine.

I tried to recompile with the debug flags, but I don't get anymore information

tatref avatar Jul 06 '24 15:07 tatref

Ok than. Then i would try without librsvg, is also rust, there was some issue with dup symbols

I didn't not manage to reproduce locally with some files. Are you able to share some file that triggers this?, would make it a lot easier to help.

Weird about debug symbols, must be something more then 🤔

wader avatar Jul 06 '24 16:07 wader

Maybe also try with just librav1e. Could also compare how alpine does things https://git.alpinelinux.org/aports/tree/community/rav1e/APKBUILD?h=3.19-stable

wader avatar Jul 06 '24 17:07 wader

Thanks for the help!

I'll do some more testing tomorrow

Also I found an old post of yours about a similar issue https://github.com/lu-zero/cargo-c/issues/98

tatref avatar Jul 06 '24 22:07 tatref

Thanks for the help!

No problem!

I'll do some more testing tomorrow

👍 tip is to try minimize the dockerfile as much as possible first and then start digg more into details. that way it will be less unrelated moving parts and much faster to iterate and try things. but again if you have a test file i can use it would be great.

Also I found an old post of yours about a similar issue lu-zero/cargo-c#98

I think that was about rust itself crashing as build time?

wader avatar Jul 06 '24 22:07 wader

You can download the example file I used here: https://photos.app.goo.gl/WVZ7D6giYhYFmbs36

However I face the issue with multiple files, so I suppose it makes no difference.

Also, the crash happens right at the beginning, so it's pretty easy to reproduce.

I think that was about rust itself crashing as build time?

My bad, I did a quick search on "cargo cbuild" and "segfault". I didn't notice at first that you were the author of the issue, that's funny!

tatref avatar Jul 06 '24 22:07 tatref

You can download the example file I used here: https://photos.app.goo.gl/WVZ7D6giYhYFmbs36

However I face the issue with multiple files, so I suppose it makes no difference.

Also, the crash happens right at the beginning, so it's pretty easy to reproduce.

Thanks. Weirdly it seems to work fine for me on a macbook m3 (arm64). What CPU are you using? could it be that librav1e or ffmpeg ends up using some instruction that is not available (feature detect at build time on build host etc)? but then it usually crashes with SIGILL hmm

$ docker run -it --rm -v "$PWD:$PWD" -w "$PWD" docker.io/mwader/static-ffmpeg:latest -v debug -i PXL_20240630_122849440.mp4 -c:v librav1e 'PXL_20240630_122849440.av1.mp4'; echo $?
...
0
$ ffprobe -hide_banner -i PXL_20240630_122849440.av1.mp4
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'PXL_20240630_122849440.av1.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomav01iso2mp41
    encoder         : Lavf61.1.100
  Duration: 00:00:09.78, start: 0.000000, bitrate: 11581 kb/s
  Stream #0:0[0x1](und): Video: av1 (libdav1d) (Main) (av01 / 0x31307661), yuv420p(tv, smpte170m/bt470bg/bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 11495 kb/s, 30 fps, 30 tbr, 15360 tbn (default)
      Metadata:
        handler_name    : ISO Media file produced by Google Inc.
        vendor_id       : [0][0][0][0]
        encoder         : Lavc61.3.100 librav1e
  Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 127 kb/s (default)
      Metadata:
        handler_name    : ISO Media file produced by Google Inc.
        vendor_id       : [0][0][0][0]

I think that was about rust itself crashing as build time?

My bad, I did a quick search on "cargo cbuild" and "segfault". I didn't notice at first that you were the author of the issue, that's funny!

😄

wader avatar Jul 06 '24 23:07 wader

A little more details on my setup: CPU is x86_64, I'm running Debian 12 and Oracle Linux 9 (Redhat 9) under Virtualbox

tatref avatar Jul 07 '24 20:07 tatref

Hmm interesting, could it be that the VM lacks support for SSE instructions etc? but looking at the code https://github.com/xiph/rav1e/blob/e34e772e47b01169b6f75a4589c056624ea886a4/src/cpu_features/x86.rs#L20 it seems like i do runtime detection hmm. Maybe you can check the VM settings? if it does not help i think we need to get a proper debug build and inspect things with gdb.

wader avatar Jul 07 '24 22:07 wader

Hey, did you get anywhere with this?

wader avatar Jul 12 '24 13:07 wader

Sorry for the delay

I managed to reproduce the issue on bare metal. The CPU is AMD Ryzen 7 5800X

I would be great is someone else could reproduce it on different hardware

tatref avatar Jul 16 '24 22:07 tatref

Ok! that is strange. If you have time it would be great to try to minimize down the Dockerfile. Maybe something like: remove everything except building rav1e and ffmpeg, very it stills crashes, after that maybe try change the build to be more like alpine https://git.alpinelinux.org/aports/tree/community/rav1e/APKBUILD?h=3.19-stable#n34 ? ... i see that they do use some newer cargoc stuff. Not sure if RUSTFLAGS="-C target-feature=+crt-static" is more or less same as --library-type staticlib and the "fixes static linking flags" thing as i recognise the -lgcc_eh issue but no sure about -lssp_nonshared and -lc. I would probably try doing the alpine way and patch the pkgconfig file.

wader avatar Jul 26 '24 15:07 wader

btw it might worth looking thru rav1e issues and see if something liks similar? things like: https://github.com/xiph/rav1e/issues?q=Ryzen https://github.com/xiph/rav1e/issues?q=illegal https://github.com/xiph/rav1e/issues?q=segfault

wader avatar Jul 29 '24 14:07 wader

one suggestion in the issues is to try with --no-default-features so maybe try change cargo cinstall --release to cargo cinstall --release --no-default-features? not a fix but might give som clue

wader avatar Jul 29 '24 14:07 wader

I did some testing today

rav1e works fine if I only enable x264 and rav1e. I kept all the compilation flags as it is.

I'm still not sure why with the full Dockerfile, it crashes

tatref avatar Jul 29 '24 23:07 tatref

That is very interesting! could you try re-add librsvg and see if it start to crash again? that is my main suspect that statically linking two rust based libraries causes some symbol conflict/mixing that is bad... but if so why it would only affect a certain type of cpu is a bit of a mystery, but it've seen werider things :)

wader avatar Jul 30 '24 09:07 wader

I tried to add a rav1e sanity test and the CI job segfauled in a similar way https://github.com/wader/static-ffmpeg/pull/490 🤔

wader avatar Jul 30 '24 11:07 wader

I stripped the Dockerfile of everything except glib, harfbuzz, cairo, pango, librsvg, fdk_aac, x264, and rav1e, this reproduces the issue!

tatref avatar Jul 30 '24 18:07 tatref

Hey, yeap! i also managed to reproduce it myself now on my old intel macbook and it only seem to happen when linking with both rav1e and librsvg. The stacktrace suggests it crashes inside the rust rayon crate, somewhere here https://github.com/rayon-rs/rayon/blob/main/rayon-core/src/registry.rs#L329-L338 ...my guess is that crash is somehow related to some issue with two rust runtimes being linked together (librsvg and rav1e both uses rayon but different version so i think it should be fine, but not sure). But it's still weird why it works on arm64, maybe for some reason symbols resolve differently and it happens to work?

Some progress at least! will do more digging tomorrow or so

wader avatar Jul 30 '24 19:07 wader

Update: i tried to recreate the two staticlib rust crates that uses rayon with same dependencies and a c program that static-pie links them but no crash on both arm64 and amd64. I'll keep digging from time to time.

BTW for your use case would using libsvtav1 be an option?

wader avatar Aug 03 '24 07:08 wader

In the end, I used a different image with a non-static ffmpeg that works for me.

I tried to recompile rav1e with rayon 1.0, it works, but ffmpeg still crashes.

Could it be relater to symbol mangling? I suppose different versions should have different names for proper linking?

tatref avatar Aug 06 '24 14:08 tatref

In the end, I used a different image with a non-static ffmpeg that works for me.

👍

I tried to recompile rav1e with rayon 1.0, it works, but ffmpeg still crashes.

The rav1e cli tools works but not ffmpeg?

Could it be relater to symbol mangling? I suppose different versions should have different names for proper linking?

Yeap i'm not sure what is going on but i suspect there is some issue with miss matching rust runtime symbols etc, e.g. that the runtime is compiled a little bit differently between libs and then gets mixed up. But a bit of a mystery why arm64 seems to work but not amd64... maybe just by chance

wader avatar Aug 06 '24 14:08 wader

Yes rav1e works. The workflow is a bit different, because input files have to be in y4m format, but it works fine.

tatref avatar Aug 06 '24 15:08 tatref