ndk
ndk copied to clipboard
HELP WANTED:Clang toolchain compile with -fopenmp cause so size increase nearly 400K
As descripted in title, the latest ndk r17 set clang as the default toolchain, the -fopenmp command cause shared library size increase as much as nearly 400Kb, but gcc in the previous version, the -fopenmp only cause increasing about 30Kb, what's the difference between gcc & clang?

@pirama-arumuga-nainar Any suggestions?
I am able to reproduce this with a simple case.
int g[1024];
int main() {
int i;
#pragma omp parallel
for (i = 0; i < 1024; i ++)
g[i] = i*i;
return 0;
}
With OpenMP, executable from GCC is 40K, while the one from Clang is 380K. This seemed to have been worse with r16, where the Clang generated binary was ~800K.
There are far too many functions in the text section and symbol table with Clang than with GCC. I'll see why this is and if we can trim this down. If it helps in the meantime, adding '-Wl,--exclude-libs,libomp.a' reduces the binary to 350K. The savings are mostly from a smaller symbol table. The text section is still big.
@pirama-arumuga-nainar As Gcc will be removed in r18, this bug cause a lot trouble to us
Using -ffunction-sections when building the libomp runtime and passing -Wl,--gc-sections while linking (which, according to @DanAlbert, is passed by ndk-build and CMake by default) helps shave a further 100k from my test. With this change, the clang-build with openmp is ~250K, which is still higher than the 40K from gcc.
This is all I can think of from a black box perspective. I'll kick off an email to openmp-dev asking for their opinion.
I've attached libomp.a.zip for arm64 built with -ffunction-sections for experimentation. Just drop it into your NDK installation.
https://android-review.googlesource.com/c/toolchain/llvm_android/+/719087 passes ffunction-sections when building the runtimes. This should be a part of r19 or whenever the NDK clang gets updated past the one in r18.
@pirama-arumuga-nainar Is there any progress?
@cocodark did this help?
I've attached libomp.a.zip for arm64 built with -ffunction-sections for experimentation. Just drop it into your NDK installation.
I have to ask upstream OpenMP developers about this - and for that I need to recreate my experiment with a newer gcc. I'll do that this week.
@pirama-arumuga-nainar After replace the libomp.a, clang still cause increase ~250K ,this is still inexplicable, which force me to revert to R14 with GCC, so ,I hope it will be resolved before NDK R18 released.
I hope it will be resolved before NDK R18 released.
If we find out that we're just building something wrong then that's possible, but if it's something that will require changes upstream then unfortunately that won't happen. It takes quite a bit of time to get changes made, merged to LLVM, pulled back to Android, tested, and then finally released.
Has anyone looked to see if the same size issues are present with prior NDKs? We've had openmp support for Clang for over a year but this bug was only opened about two weeks ago.
As far as I know this has not been resolved upstream, so moving to r20.
@DanAlbert Thank you
Still no upstream progress.
Still no changes upstream afaik.
I don't suppose the original problem was that the GCC build used for comparison was using a shared libomp whereas Clang was using a static one?
@DanAlbert what's the upstream bug link?
@pirama-arumuga-nainar might know. I've just been asking him iirc.
There's no upstream bug. We'd need steps to reproduce before reporting in upstream - so if we can reproduce for any Linux target, that'd be best. It'd also help us understand if this is an issue with how we're building for Android.
Isn't https://github.com/android/ndk/issues/742#issuecomment-405695546 a repro case? I'll check rq to see if this is still a problem on r21. I sort of wonder if the problem was actually just that gcc was using a shared openmp and we didn't have that for Clang until r21...
(no, we never had a shared omp runtime for gcc)
Huh? IIRC, when adding openmp runtimes for Clang, we added static openmp for compatibility with gcc.
Sorry, I mean't we never had a shared omp runtime. Caffeine hasn't made it to my blood stream yet.
I do still see a fairly significant increase in size when using openmp:
For the test case above: without -fopenmp: 8.0K with -fopenmp: 280K
Our openmp runtime doesn't seem to link properly on my debian machine. Pirama is looking...
The 8K number was slightly misleading because the foo.cpp from earlier turned out to be a no-op for gcc/libgomp.
I looked at it briefly and filed an upstream bug. Here's the content from there for cross reference:
This was originally reported a while ago in the Android NDK bug tracker as https://github.com/android/ndk/issues/742.
Statically-linking libomp.a for a simple OpenMP hello world program, https://computing.llnl.gov/tutorials/openMP/samples/C/omp_hello.c, produces a 468K binary (after strip). gcc + libgomp.a, in comparison, produces a binary of size 128K.
I want a sanity check if this is expected behavior or if the runtime can be better organized for static linking.
Steps to reproduce:
Build OpenMP statically with:
-DLIBOMP_ENABLE_SHARED=OFF -DCMAKE_BUILD_TYPE=Release
$ du -sh a.out
548K
$ strip a.out && du -sh a.out
468K
-
gold produces a 464K binary.
-
Adding -fvisibility=hidden -ffunction-sections -fdata-sections to C, CXX flags, and -Wl,--gc-sections, the unstripped binary is 2.2M but the stripped binary is 344K.
-
Bloaty [https://github.com/google/bloaty]:
75.9% 414Ki 71.2% 338Ki [1339 Others]
5.5% 30.3Ki 6.4% 30.3Ki [section .rodata]
2.8% 15.1Ki 3.2% 15.0Ki __kmp_aux_affinity_initialize()
1.9% 10.6Ki 2.2% 10.6Ki __kmp_stg_parse_affinity(char const*, char const*, void*)
0.0% 39 2.0% 9.65Ki __kmp_sighldrs
1.5% 8.35Ki 1.8% 8.35Ki [section .text]
1.5% 8.03Ki 1.7% 7.99Ki __kmp_fork_call
1.4% 7.89Ki 1.6% 7.78Ki __kmp_affinity_create_cpuinfo_map(AddrUnsPair**, int*, kmp_i18n_id*, _IO_FILE*)
1.4% 7.68Ki 1.6% 7.58Ki __kmp_affinity_create_x2apicid_map(AddrUnsPair**, kmp_i18n_id*)
0.9% 5.18Ki 1.1% 5.12Ki __kmp_stg_parse_omp_schedule(char const*, char const*, void*)
0.9% 4.97Ki 1.0% 4.93Ki __kmp_allocate_team
0.8% 4.52Ki 0.9% 4.43Ki __kmp_affinity_create_apicid_map(AddrUnsPair**, kmp_i18n_id*)
0.8% 4.36Ki 0.0% 0 [Unmapped]
0.0% 52 0.8% 4.00Ki __kmp_threadprivate_d_table
0.7% 3.98Ki 0.8% 3.82Ki void __kmp_dispatch_init_algorithm<long long>(ident*, int, dispatch_private_info_template<long long>*, sched_type, long long, long long, traits_t<long long>::signed_t, traits_t<long long>::signed_t, long long, long long)
0.7% 3.96Ki 0.8% 3.92Ki __kmp_balanced_affinity
0.7% 3.77Ki 0.8% 3.71Ki __kmp_stg_parse_schedule(char const*, char const*, void*)
0.7% 3.56Ki 0.0% 0 [section .symtab]
0.6% 3.42Ki 0.7% 3.37Ki __kmp_aux_capture_affinity
0.6% 3.23Ki 0.7% 3.16Ki __kmp_partition_places(kmp_team*, int)
0.6% 3.06Ki 0.6% 2.89Ki int __kmp_dispatch_next_algorithm<unsigned long long>(int, dispatch_private_info_template<unsigned long long>*, dispatch_shared_info_template<unsigned long long> volatile*, int*, unsigned long long*, unsigned long long*, traits_t<unsigned long long>::signed_t*, unsigned long long, unsigned long long)
100.0% 546Ki 100.0% 474Ki TOTAL