ndk icon indicating copy to clipboard operation
ndk copied to clipboard

[BUG] ld.lld: error: Invalid record (Producer: 'LLVM12.0.5git' Reader: 'LLVM 12.0.5git')

Open DoDoENT opened this issue 4 years ago • 32 comments

Description

Happens when LTO is enabled for armeabi-v7a ABI with NDK r23 - the same code works correctly with NDK r22b.

I still haven't been able to create a minimum reproducible sample (I'll update this issue if I manage to do so), but @lmglmg and I observed the same issue also with armv7 slice when building the same code for iOS using Xcode 13 (it also uses the LLVM 12-based backend).

I'm posting this here in case if someone has already seen this issue and found a workaround.

Environment Details

  • NDK Version: r23
  • Build system: cmake
  • Host OS: MacOS
  • ABI: armeabi-v7a
  • NDK API level: 16
  • Device API level: 16

DoDoENT avatar Sep 27 '21 12:09 DoDoENT

It appears that this has occasionally happened even with earlier versions of LLVM, but not on our codebase...

DoDoENT avatar Sep 27 '21 12:09 DoDoENT

On iOS we get Invalid record (Producer: 'APPLE_1_1300.0.29.3_0' Reader: 'LLVM APPLE_1_1300.0.29.3_0') for architecture armv7.

My current hunch is that it's somehow related to combining -fenable-matrix and -flto, as all our other projects that don't use clang built-in matrices work correctly. Disabling -flto makes the build work, however, disabling -fenable-matrix is not so simple in our codebase (we are trying to see this, also making min reproducible sample).

Note that arm64 works correctly.

DoDoENT avatar Sep 27 '21 14:09 DoDoENT

I haven't heard of that error before. Maybe @pirama-arumuga-nainar has?

Given that it's a toolchain crash that was introduced in r23, it'd also be worth trying your code with the latest canary once we actually move it to a new toolchain. Leaving this open as a reminder to notify once we've done that (otherwise we'll close until you can share a repro case, but that's just for tracking reasons; it doesn't mean we won't fix it).

DanAlbert avatar Sep 27 '21 21:09 DanAlbert

This can happens when LLVM IR (produced during LTO) from newer clang is passed to older linker. If this may be the case, I'd suggest to bisect/reduce the set of files having LTO to find a problematic library.

Beyond this suggestion, would need a repro to investigate.

pirama-arumuga-nainar avatar Sep 27 '21 21:09 pirama-arumuga-nainar

This can happens when LLVM IR (produced during LTO) from newer clang is passed to older linker. If this may be the case, I'd suggest to bisect/reduce the set of files having LTO to find a problematic library.

Wouldn't, in that case, be a different error, i.e. different versions of producer and consumer? Anyway, in our case, we build all our code from the source using the same compiler, so this is definitely not the cause for us.

Yesterday, I've narrowed it down to two source files that use matrix multiplications assisted by clangs native matrix support (-fenable-matrix) - after I remove those two files from the project, the linker succeeds. However, if I simply set -fno-lto on those two files, but keep them compiling, I get the same linker error 🤷‍♂️. Only by completely disabling the LTO on this project (but not on the libraries that it gets linked to), the linker succeeds.

Unfortunately, as I said earlier, we have this same problem also with Xcode 13, but disabling LTO doesn't help there - it just turns the linker error into the compiler error on one single source file (getting Invalid record (Producer: 'APPLE_1_1300.0.29.3_0' Reader: 'LLVM APPLE_1_1300.0.29.3_0') for architecture armv7 as compile error has never happened to us before, however, by analyzing this source file we may find out what triggers the bug and, hopefully, make a minimum repro case - fortunately, this time both NDK r23 and Apple used same/similar commit from upstream LLVM with slightly different behaviour, which may or may not help us in tracking down this bug). Stay tuned!

DoDoENT avatar Sep 28 '21 07:09 DoDoENT

The problem is with __builtin_matrix_column_major_load and __builtin_matrix_column_major_store when used in template contexts - the errors vary from ICEs (in debug builds with NDK r22) to the above error in release build with r23. We're unable to produce a minimal repro as of yet...

psiha avatar Sep 28 '21 15:09 psiha

It also happens for x86. So, it appears that it may be a 32-bit-specific issue, not armv7-specific.

DoDoENT avatar Sep 28 '21 15:09 DoDoENT

I don't think the presubmit builds are downloadable by non googlers, but https://android-review.googlesource.com/c/platform/ndk/+/1839724 updates to a newer LLVM. I've only lightly tested it but all our tests build fine and run correctly on KitKat (just what I have on hand atm; it's going to be a week or so before I have all my usual devices available and I didn't want to delay getting you a build to test).

Shortly after that's submitted a build will start on https://ci.android.com/builds/branches/aosp-master-ndk/grid that you can try. If that fixes the problem then you can save yourself the effort of reducing a repro case. If it doesn't, we're probably picking up one more compiler update before we ship r24, so we'll want to try that too.

(bonus points: it sounds like if we do confirm that the new build fixes the problem you'll probably get the iOS fix in the next version of xcode too)

DanAlbert avatar Sep 29 '21 07:09 DanAlbert

I don't think the presubmit builds are downloadable by non googlers, but https://android-review.googlesource.com/c/platform/ndk/+/1839724 updates to a newer LLVM.

If you can smuggle a macOS (or even Linux) build to me, I could test it with our codebase and give you feedback. Also, emscripten 2.0.27 no longer has ICE with that code (it's based on LLVM 14), so this may be a good indication that the newer LLVM no longer has the problem. I've tested that yesterday when I was applying @psiha's workaround in our codebase - I've had to check all combinations (NDK r23 debug/release all ABIs (this is how I discovered that problem is also present for x86) and Emscripten 2.0.27 debug/release (we still haven't updated to later version of emscripten, but from their changelog, they haven't touched the compiler since 2.0.27)).

DoDoENT avatar Sep 29 '21 07:09 DoDoENT

If you can smuggle a macOS (or even Linux) build to me, I could test it with our codebase and give you feedback

That'd be paragraph 2 :) It'll have builds for macos/linux/windows after it's submitted: https://ci.android.com/builds/branches/aosp-master-ndk/grid (all our canary builds, if you haven't seen that before: https://android.googlesource.com/platform/ndk/+/master/docs/ContinuousBuilds.md)

emscripten 2.0.27 no longer has ICE with that code (it's based on LLVM 14), so this may be a good indication that the newer LLVM no longer has the problem

The build I just picked up is LLVM 13, but the build we'll actually ship for r24 will be LLVM 14 I believe. That does sound promising. I think we're only a few weeks away from having the LLVM 14 based toolchain in the canary build (but @stephenhines or @pirama-arumuga-nainar will know better).

DanAlbert avatar Sep 29 '21 08:09 DanAlbert

https://ci.android.com/builds/branches/aosp-master-ndk/grid?head=7778403&tail=7778403 is the build that has the newer LLVM. lmk if it fixes the problem. If not, like I said, we have another update coming in the near future you can also try.

DanAlbert avatar Sep 29 '21 18:09 DanAlbert

Unfortunately, it also happens with the canary r24 build. Except that now the error is ld.lld: error: Invalid record (Producer: 'LLVM13.0.2git' Reader: 'LLVM 13.0.2git')

Let me know when you have the build based on LLVM 14 so I can try again.

DoDoENT avatar Sep 30 '21 13:09 DoDoENT

Will do. Thanks for checking.

DanAlbert avatar Sep 30 '21 18:09 DanAlbert

One other thing to try, which can potentially isolate this problem: For the two sources which cause problems with -fenable-matrix, can you compile them with -save-temps? That'd force the write + read back of the bitcode in compilation itself instead of during LTO.

pirama-arumuga-nainar avatar Sep 30 '21 18:09 pirama-arumuga-nainar

Sorry for late answer - it was a busy week.

When I add the -save-temps flag, compilation of those source files fail with very peculiar error (redefinitions of various stuff). For example:

In file included from /Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/c++/v1/cstdint:143:
In file included from /Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/c++/v1/__config:218:
In file included from /Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/features.h:36:
In file included from /Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/sys/cdefs.h:371:
In file included from /Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/android/api-level.h:179:
/Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/bits/get_device_api_level_inlines.h:41:19: error: redefinition of 'android_get_device_api_level'
static inline int android_get_device_api_level() {
                  ^
/Users/dodo/.conan/data/AndroidNdk/r23/microblink/stable/package/743cf0321be3152777da4d05247a66d1552e70a2/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include/bits/get_device_api_level_inlines.h:41:42: note: previous definition is here
__BIONIC_GET_DEVICE_API_LEVEL_INLINE int android_get_device_api_level() {

I get similar errors for all different definitions that come both from PCH and direct includes.

After I disable building with PCH, I get the same linker error as if not using that flag:

ld.lld: error: Invalid record (Producer: 'LLVM12.0.5git' Reader: 'LLVM 12.0.5git')

I've even tried adding the -save-temps flag to all my source files - still no difference.

DoDoENT avatar Oct 08 '21 16:10 DoDoENT

https://ci.android.com/builds/branches/aosp-master-ndk/grid?head=7924883&tail=7924883 (still building) has LLVM 14ish in it. Give that a shot?

DanAlbert avatar Nov 17 '21 20:11 DanAlbert

Hi @DanAlbert, sorry for the late answer. I've been trying to reproduce this back with NDK r23 (just to make sure that no workarounds are in place before testing with the latest r24 beta 2), but it no longer crashes since we moved our build system to use ThinLTO (actually a mix of ThinLTO and LTO objects due to the needs of workaround for #1601).

I'll close that for now as even I can no longer reproduce it 🤷 .

The weirdest thing is that the very file that uses __builtin_matrix_column_* builtins is built with -flto as a workaround for #1601, but it links with objects built with -flto=thin and the issue is now gone. What a weird world we live in...

DoDoENT avatar Dec 17 '21 11:12 DoDoENT

No worries, thanks for the update.

DanAlbert avatar Jan 04 '22 21:01 DanAlbert

This is STILL an issue. /usr/lib/clc/gfx1100-amdgcn-mesa-mesa3d.bc': Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 17.0.6') Can't build openCL as a result for use on RX7900XTX. Radeon RX 7900 XTX (radeonsi, navi31, LLVM 17.0.6, DRM 3.54, 6.6.15-amd64)
OpenCL 1.1 Mesa 23.3.5-1

That's part of clinfo and actually everything looks fine except for this and I've tried compiling this twice making sure there's no trace of 19.0.0. It does however build the gfx1100 file when I complile libclc but for reasons I don't understand everything is symlinked to tahiti which may work in theory, but in reality, openCL wants that file specifically.

AnonymousRonin avatar May 17 '24 06:05 AnonymousRonin

Can you upload a repro case? Nothing about that has changed. We can't fix bugs that we can't see.

DanAlbert avatar May 17 '24 17:05 DanAlbert

What exactly are you looking for. The issue seems to be a conflict between how it's generated and how it's being read. Ie the file is created while compliling LLVM 17 but then is being read by clang 19? If I understand that correctly. If so, I'm curious why that would be an issue. I'm attempting to compile version 18.1.1 of both from source making sure that both versions match to see what happens.

Sent from Proton Mail mobile

-------- Original Message -------- On May 17, 2024, 10:59 AM, Dan Albert wrote:

Can you upload a repro case? Nothing about that has changed. We can't fix bugs that we can't see.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

AnonymousRonin avatar May 18 '24 18:05 AnonymousRonin

/usr/lib/clc/gfx1100-amdgcn-mesa-mesa3d.bc': Invalid attribute group entry (Producer: 'LLVM19.0.0git' Reader: 'LLVM 17.0.6')

It looks like gfx1100-amdgcn-mesa-mesa3d.bc is generated by llvm 19.

Can't build openCL as a result for use on RX7900XTX. Radeon RX 7900 XTX (radeonsi, navi31, LLVM 17.0.6, DRM 3.54, 6.6.15-amd64)

The OpenCL driver seems to be using an old LLVM version.

The error messge doesn't indicate any connection to Android. You may want to look for resolution in other forums.

pirama-arumuga-nainar avatar May 20 '24 17:05 pirama-arumuga-nainar

I'm currently testing r28 ~beta2~ RC2 aka beta3 in vcpkg. Port hdf5 gave this error (x64, static library linkage, static CRT linkage):

: && /vcpkg/android-ndk-r28-beta3/toolchains/llvm/prebuilt/linux-x86_64/bin/clang --target=x86_64-none-linux-android21 --sysroot=/vcpkg/android-ndk-r28-beta3/toolchains/llvm/prebuilt/linux-x86_64/sysroot -std=c99 -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security  -fPIC   -static -fno-limit-debug-info -static-libstdc++ -Wl,--build-id=sha1 -Wl,--no-rosegment -Wl,--no-undefined-version -Wl,--fatal-warnings -Wl,--no-undefined -Qunused-arguments tools/src/h5stat/CMakeFiles/h5stat.dir/h5stat.c.o -o bin/h5stat  bin/libhdf5_tools_debug.a  bin/libhdf5_debug.a  -lm  -ldl  /mnt/vcpkg-ci/installed/x64-android/debug/lib/libz.a  -ldl  /mnt/vcpkg-ci/installed/x64-android/debug/lib/libz.a  -latomic -lm && :
ld.lld: error: /vcpkg/android-ndk-r28-beta3/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/x86_64-linux-android/libdl.a(libdl_static.o): Invalid record
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

Same error for arm64 but no error for arm-neon.

https://dev.azure.com/vcpkg/public/_build/results?buildId=111618&view=results

dg0yt avatar Jan 19 '25 09:01 dg0yt

Unsure if this is the same bug and/or related, but I'm seeing the same build error after upgrading from r25 to r28. This is occurring when building "Debug" for "arm64-v8a".

cmd.exe /C "cd . && C:\Users\<user>\AppData\Local\Android\Sdk\ndk\28.0.13004108\toolchains\llvm\prebuilt\windows-x86_64\bin\clang++.exe --target=aarch64-none-linux-android21 --sysroot=C:/Users/<user>/AppData/Local/Android/Sdk/ndk/28.0.13004108/toolchains/llvm/prebuilt/windows-x86_64/sysroot -fPIC -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security  -std=c++17 -fexceptions -Wall -Wextra -Werror -Wno-error=unused-command-line-argument  -Wformat -Wformat-security -Werror=format-security -fstack-protector-all -D_FORTIFY_SOURCE=2 -std=c++17 -fexceptions -Wall -Wextra -Werror -Wno-error=unused-command-line-argument -Wformat -Wformat-security -Werror=format-security -fstack-protector-all -D_FORTIFY_SOURCE=2 -O2 -mspeculative-load-hardening -fvisibility=hidden -fno-limit-debug-info  -fcommon -Wl,-z,relro -Wl,-z,now -Wl,-z,noexecstack -shared -o  && cd ."
  ld.lld: error: C:/Users/<user>/AppData/Local/Android/Sdk/ndk/28.0.13004108/toolchains/llvm/prebuilt/windows-x86_64/sysroot/usr/lib/aarch64-linux-android/libz.a(adler32.o): Invalid record
  clang++: error: linker command failed with exit code 1 (use -v to see invocation)
  ninja: build stopped: subcommand failed.

SahilAshar avatar Feb 07 '25 20:02 SahilAshar

Hmm, libz.a and libdl.a in the NDK seem to be built with LTO and have LLVM IR archives. That's definitely a mistake, but likely broken for a few releases - ever since the soong build system defaulted to ThinLTO. We should disable LTO for static libraries that ship in the NDK.

That in itself shouldn't be a problem - as long as the linker and compiler are built at the same version of LLVM, the linker should be able to read the LTO-enabled libraries. I think the problem here is that the NDK sysroot was generated by a newer clang in AOSP than that shipped in r28.

pirama-arumuga-nainar avatar Feb 07 '25 21:02 pirama-arumuga-nainar

Looking at the NDK sysroots, only libdl.a, libz.a and libm.a are from outside the Toolchain. libm already has LTO off. We should do so for the other two libraries as well.

pirama-arumuga-nainar avatar Feb 07 '25 21:02 pirama-arumuga-nainar

I hadn't spotted that @dg0yt's error was from libdl.a. @dg0yt: whatever that build is doing, it's probably wrong. Android's libdl.a does basically nothing. It exists pretty much only to appease autoconf (there are some "does this function/library exist?" checks that for some reason check only the static library).

Yes though, both ought to be fixed. It looks like libm explicitly disables LTO, as does libc (which is why this doesn't show up in our tests, we don't check that every static library works, only libc). The blame says it's because LTO doesn't work with ifuncs rather than being anything to do with the NDK. It'd probably be best to find a way to globally disable LTO for the NDK config in soong.

This should be safe for r28b as long as I'm quick about it. Sysroot updates are pretty risky if there are big changes to the sysroot, but r28 had one quite recently before release so it's probably safe. idk how long the change will take, but I'll get started on it next week. If the diff in the sysroot is too risky for r28b, the fix will unfortunately have to wait until r29.

DanAlbert avatar Feb 07 '25 21:02 DanAlbert

find a way to globally disable LTO for the NDK config in soong.

DISABLE_LTO=TRUE should do it: https://cs.android.com/android/platform/superproject/main/+/main:build/soong/cc/lto.go;l=73;drc=6f01658ba8876f6e6e27f77f1c37eb0406f9a3e5. The whole platform build probably no longer builds without LTO but my guess is the subset that's built for the NDK config should be fine.

pirama-arumuga-nainar avatar Feb 07 '25 21:02 pirama-arumuga-nainar

Yeah, the only real code from that build that will ever run is in these static libraries. The rest is just headers and stubs.

DanAlbert avatar Feb 07 '25 22:02 DanAlbert

Android's libdl.a does basically nothing. It exists pretty much only to appease autoconf (there are some "does this function/library exist?" checks that for some reason check only the static library).

Is -ldl evil, too? I don't have many vcpkg ports installed right now, but

$ grep '[-l]dl\|libdl[.]a' installed/arm64-android/lib/pkgconfig/*
installed/arm64-android/lib/pkgconfig/gdal.pc:CONFIG_INST_LIBS="-L${prefix}/lib" -lgdal "-L${prefix}/lib" -lgeotiff -lspatialite -lgeos_c -lgeos -L/home/kpa/SPECIAL/Programme/android-ndk-r27c/toolchains/llvm/prebuilt/linux-x86_64/lib -lfreexl -lexpat -lminizip -latomic -lxml2 -liconv -lproj -lsqlite3 -ltiff -lwebp -lsharpyuv -lcpufeatures-webp -llzma -ljpeg -lzstd -lLerc -ldeflate -lcurl -lcares -lssl -lssh2 -lcrypto -ldl -lz -ljson-c -pthread -lc++ -lm
installed/arm64-android/lib/pkgconfig/hdf5.pc:Libs: "-L${libdir}" -lhdf5 -lm -ldl
installed/arm64-android/lib/pkgconfig/libcrypto.pc:Libs: "-L${libdir}" -lcrypto -ldl -pthread
installed/arm64-android/lib/pkgconfig/proj.pc:Libs: "-L${libdir}" -lproj -pthread -lc++ -lm -ldl
installed/arm64-android/lib/pkgconfig/sdl3.pc:Libs: "-L${libdir}" -lSDL3 -lm -lOpenSLES -ldl -llog -landroid -lGLESv1_CM -lGLESv2
installed/arm64-android/lib/pkgconfig/spatialite.pc:Libs: "-L${libdir}" -lspatialite "-L${prefix}/lib/pkgconfig/../../lib" -lgeos_c -lgeos -lxml2 -lproj -lc++ -ltiff -ldeflate -ljpeg -lLerc -lc++ -llzma -lzstd -pthread -pthread -lwebp -lsharpyuv -lcurl -L/home/kpa/SPECIAL/Programme/android-ndk-r27c/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android/21 -lcares -lssl -lssh2 -lcrypto -lsqlite3 -pthread -ldl -lfreexl -liconv -lexpat -lm -lminizip -lz -latomic -lm "-L${prefix}/lib/pkgconfig/../../lib" -lxml2 -pthread -liconv -lm -llzma -pthread -lz "-L${prefix}/lib/pkgconfig/../../lib" -lsqlite3 -pthread -ldl -lm
installed/arm64-android/lib/pkgconfig/sqlite3.pc:Libs: "-L${libdir}" -lsqlite3 -pthread -ldl

This includes libcrypto from OpenSSL. (This includes a few other bad things, please ignore that.)

And CMake CMAKE_DL_LIBS says dl. And A find_library on dl (e.g. in pkg_check_modules) may return <full_path>/libdl.a.

dg0yt avatar Feb 08 '25 05:02 dg0yt