ndk
ndk copied to clipboard
[BUG] Exceptions in arm64-v8a are slower than in armeabi-v7a
Description
Exceptions on arm64-v8a are much slower than on armeabi-v7a. Profiling shows that most of the work is done in dl_iterate_phdr callback. Severity of this effect is proportional to the number of objects (and, possibly, their size) loaded by dynamic linker. Attached archive contains a very synthetic example that creates a hundred libraries and measures time of exception handling in an executable that depends on all of them. I received following results when running this example on the phone:
sizeof(void*) = 4
Caught 10000 exceptions in 879 ms
sizeof(void*) = 8
Caught 10000 exceptions in 7201 ms
Please set NDK and HOST_TAG environment variables before running the test.
Exceptions in a native code run from an actual Java application for Android is about ten times slower in arm64 version of the library than in armv7.
Environment Details
- NDK Version: 20.0.5594570
- Build system: bash script
- Host OS: Mac
- ABI: arm64-v8a
- NDK API level:21
- Device API level:28
@cferris1000
I haven't benchmarked it, but it makes sense that non-arm32 unwinding would scale poorly with the number of libraries, relative to arm32. I think the unwinder needs to find the unwind info for a PC (PT_GNU_EH_FRAME or PT_ARM_EXIDX segment?), and it needs to query the dynamic linker for that info.
-
On arm32, there is a special-purpose function (
__gnu_Unwind_Find_exidxin libc that delegates todl_unwind_find_exidxin libdl). This function takes a PC value and returns the location of the matching PT_ARM_EXIDX segment. The linker still just does a linear scan of all libraries, but it's in one function,find_containing_library:for (soinfo* si = solist_get_head(); si != nullptr; si = si->next) { if (address < si->base || address - si->base >= si->size) { continue; } ... snipped -- verify that the address matches a PT_LOAD segment ... }Once the DSO is found, the location of the EXIDX is returned from the soinfo struct:
*pcount = si->ARM_exidx_count; return reinterpret_cast<_Unwind_Ptr>(si->ARM_exidx); -
There is no similar function for arm64, so I think the unwinder (in libgcc) instead uses
dl_iterate_phdr.dl_iterate_phdrreports each library usingdl_phdr_info, which doesn't report a start/end address. For every loaded library, until we find the matching library:- the loader calls the callback function pointer
- the callback scans the PHDRs looking for a matching PT_LOAD
- libgcc does a bit of work that could be skipped (e.g. recording the locations of other segments, verifying that the dl_phdr_info's size is at least sizeof(dl_phdr_info), checking the unused cache)
libgcc uses an 8-element LRU cache for this PC->eh_frame lookup. (It moves matches to the front of a linked list.) However, the cache is disabled on Android because it relies on extra glibc fields in the dl_phdr_info struct to invalidate the cache after a library is loaded or unloaded. If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.
Possible fixes (for new versions of Android):
- Add the extra
dl_phdr_infofields,unsigned long long dlpi_addsandunsigned long long dlpi_subs. The fields are documented as incrementing when a library is loaded or unloaded. The fields should enable the libgcc caching. - Add a new linker API like
dl_unwind_find_exidxbut for archs other than arm32.
I wondered if the unwinder could use dladdr instead, but dladdr scans dynsym for a matching symbol, so it's also slow.
Internal issues for the new Android unwinder -- http://b/65682711, http://b/66918884.
If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.
I suspect this overstates it -- a call stack might go through more than 8 libraries, but if there are multiple call stack entries in a single DSO, I'd guess the cache is still useful.
I uploaded a WIP patch adding the extra dl_iterate_phdr fields, https://android-review.googlesource.com/c/platform/bionic/+/1104597. It improves the arm64 run-time on the extest from ~7000-7400ms to ~940ms.
Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?
While this itself is great, I think that there is a way to somewhat alleviate this issue for the users today by applying a patch to gcc unwinder.
I tried to rebuild armv8 version of libc++_shared.so with llvm/libunwind (which is used in armeabi-v7a version by default) and discovered that it is about 5 times faster in the exception-heavy places than libgcc unwinder version. Further debugging showed that llvm/libunwind dl_iterate_phdr callback contains an if statement, that rejects most of the loaded objects in the beginning of this callback here.
I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?
Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?
Yeah, the patch I submitted would only help with Android 11, so it'd be nice to do something for existing OSs.
I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?
It looks possible. I think I can upload a patch for it.
I uploaded a libgcc CL to https://android-review.googlesource.com/c/toolchain/gcc/+/1130238.
This CL didn't seem to help by 5x, and it made the EH performance much more variable. e.g. Instead of 6300-6700ms, I see results from 2500-5200ms. I suspect the variability comes from the varying addresses of loaded libraries. The optimization might fire on all the loaded libraries, or none of them.
I haven't tried using llvm/libunwind with arm64.
I tried LLVM's unwinder, using the version in the NDK that's already used for arm32. On extest, I saw runtimes varying between 3700-6800ms. It doesn't have a cache at all, so it doesn't benefit from the new adds/subs fields in new versions of Android. I didn't see a 5x improvement. The upstream version doesn't have a cache either.
I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.
I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.
It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1
It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1
We tried to build our application with libc++_shared.so patched with these changes and also got significant performance improvement in exception-heavy places. But I see that code review is more than 9 months old and still open. Is there a chance that these changes will be included in the official NDK build in the future?
The first release that it could be shipped in (r23) is also the release where we plan to stop using that unwinder, so getting that change submitted won't buy us anything. I'll triage this to r23 though, so if we end up not finishing that migration we can revisit it.
Spoke with @rprichard to clarify and the LLVM unwinder will need a similar improvement. LLVM's unwinder should be quite fast on the newer Android releases that include the libdl improvements mentioned above, but it'll need the same caching improvements to improve performance on older devices.