ndk [BUG] Exceptions in arm64-v8a are slower than in armeabi-v7a

trafficstars

Description

Exceptions on arm64-v8a are much slower than on armeabi-v7a. Profiling shows that most of the work is done in dl_iterate_phdr callback. Severity of this effect is proportional to the number of objects (and, possibly, their size) loaded by dynamic linker. Attached archive contains a very synthetic example that creates a hundred libraries and measures time of exception handling in an executable that depends on all of them. I received following results when running this example on the phone:

sizeof(void*) = 4
Caught 10000 exceptions in 879 ms
sizeof(void*) = 8
Caught 10000 exceptions in 7201 ms

Please set NDK and HOST_TAG environment variables before running the test.

Exceptions in a native code run from an actual Java application for Android is about ten times slower in arm64 version of the library than in armv7.

extest.tar.gz

Environment Details

NDK Version: 20.0.5594570
Build system: bash script
Host OS: Mac
ABI: arm64-v8a
NDK API level:21
Device API level:28

Aug 16 '19 15:08 Trivias

@cferris1000

I haven't benchmarked it, but it makes sense that non-arm32 unwinding would scale poorly with the number of libraries, relative to arm32. I think the unwinder needs to find the unwind info for a PC (PT_GNU_EH_FRAME or PT_ARM_EXIDX segment?), and it needs to query the dynamic linker for that info.

On arm32, there is a special-purpose function (__gnu_Unwind_Find_exidx in libc that delegates to dl_unwind_find_exidx in libdl). This function takes a PC value and returns the location of the matching PT_ARM_EXIDX segment. The linker still just does a linear scan of all libraries, but it's in one function, find_containing_library:
```
for (soinfo* si = solist_get_head(); si != nullptr; si = si->next) {
  if (address < si->base || address - si->base >= si->size) {
    continue;
  }
  ... snipped -- verify that the address matches a PT_LOAD segment ...
}
```
Once the DSO is found, the location of the EXIDX is returned from the soinfo struct:
```
 *pcount = si->ARM_exidx_count;
 return reinterpret_cast<_Unwind_Ptr>(si->ARM_exidx);
```
There is no similar function for arm64, so I think the unwinder (in libgcc) instead uses dl_iterate_phdr. dl_iterate_phdr reports each library using dl_phdr_info, which doesn't report a start/end address. For every loaded library, until we find the matching library:
- the loader calls the callback function pointer
- the callback scans the PHDRs looking for a matching PT_LOAD
- libgcc does a bit of work that could be skipped (e.g. recording the locations of other segments, verifying that the dl_phdr_info's size is at least sizeof(dl_phdr_info), checking the unused cache)

libgcc uses an 8-element LRU cache for this PC->eh_frame lookup. (It moves matches to the front of a linked list.) However, the cache is disabled on Android because it relies on extra glibc fields in the dl_phdr_info struct to invalidate the cache after a library is loaded or unloaded. If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.

Possible fixes (for new versions of Android):

Add the extra dl_phdr_info fields, unsigned long long dlpi_adds and unsigned long long dlpi_subs. The fields are documented as incrementing when a library is loaded or unloaded. The fields should enable the libgcc caching.
Add a new linker API like dl_unwind_find_exidx but for archs other than arm32.

I wondered if the unwinder could use dladdr instead, but dladdr scans dynsym for a matching symbol, so it's also slow.

Aug 16 '19 22:08 rprichard

Internal issues for the new Android unwinder -- http://b/65682711, http://b/66918884.

Aug 16 '19 22:08 rprichard

If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.

I suspect this overstates it -- a call stack might go through more than 8 libraries, but if there are multiple call stack entries in a single DSO, I'd guess the cache is still useful.

I uploaded a WIP patch adding the extra dl_iterate_phdr fields, https://android-review.googlesource.com/c/platform/bionic/+/1104597. It improves the arm64 run-time on the extest from ~7000-7400ms to ~940ms.

Aug 17 '19 01:08 rprichard

Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?

While this itself is great, I think that there is a way to somewhat alleviate this issue for the users today by applying a patch to gcc unwinder.

I tried to rebuild armv8 version of libc++_shared.so with llvm/libunwind (which is used in armeabi-v7a version by default) and discovered that it is about 5 times faster in the exception-heavy places than libgcc unwinder version. Further debugging showed that llvm/libunwind dl_iterate_phdr callback contains an if statement, that rejects most of the loaded objects in the beginning of this callback here.

I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?

Sep 30 '19 02:09 Trivias

Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?

Yeah, the patch I submitted would only help with Android 11, so it'd be nice to do something for existing OSs.

I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?

It looks possible. I think I can upload a patch for it.

Oct 01 '19 00:10 rprichard

I uploaded a libgcc CL to https://android-review.googlesource.com/c/toolchain/gcc/+/1130238.

This CL didn't seem to help by 5x, and it made the EH performance much more variable. e.g. Instead of 6300-6700ms, I see results from 2500-5200ms. I suspect the variability comes from the varying addresses of loaded libraries. The optimization might fire on all the loaded libraries, or none of them.

I haven't tried using llvm/libunwind with arm64.

Oct 01 '19 05:10 rprichard

I tried LLVM's unwinder, using the version in the NDK that's already used for arm32. On extest, I saw runtimes varying between 3700-6800ms. It doesn't have a cache at all, so it doesn't benefit from the new adds/subs fields in new versions of Android. I didn't see a 5x improvement. The upstream version doesn't have a cache either.

I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.

Oct 03 '19 01:10 rprichard

I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.

It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1

Oct 04 '19 06:10 rprichard

It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1

We tried to build our application with libc++_shared.so patched with these changes and also got significant performance improvement in exception-heavy places. But I see that code review is more than 9 months old and still open. Is there a chance that these changes will be included in the official NDK build in the future?

Jul 08 '20 13:07 Trivias

The first release that it could be shipped in (r23) is also the release where we plan to stop using that unwinder, so getting that change submitted won't buy us anything. I'll triage this to r23 though, so if we end up not finishing that migration we can revisit it.

Jul 08 '20 19:07 DanAlbert

Spoke with @rprichard to clarify and the LLVM unwinder will need a similar improvement. LLVM's unwinder should be quite fast on the newer Android releases that include the libdl improvements mentioned above, but it'll need the same caching improvements to improve performance on older devices.

Jul 09 '20 21:07 DanAlbert

ndk ndk copied to clipboard

[BUG] Exceptions in arm64-v8a are slower than in armeabi-v7a

Description

Environment Details

ndk
ndk copied to clipboard