ndk
ndk copied to clipboard
[BUG] Awfully slow compilation of a source file with a big global array on 32-bit targets
Description
Clang++ compilation of one specific source file in my project is terribly slow for 32-bit architectures.
On my Core i7-3770K, compiling aarch64 object file takes 13 seconds, but compiling armv7a object file takes 32 minutes, which is 150 times slower.
I am using -Oz optimization, but the results are the same with -O2.
The problem seems to happen because of a big globally-initialized array, which has around 390 entries.
If I edit file src/table/settings.h and replace the array const SettingDesc _settings[] with an empty array, compilation takes 2 seconds on all targets. If I reduce array size to 100 items, compilation of armv7a target is reduced to 24 seconds.
To reproduce: unpack slow-compile.zip, launch script slow-compile.sh, it will report compilation speed using time command.
I have included the source file and all the headers it uses into the archive.
Environment Details
- NDK Version: 22.0.7026061
- Build system: Standalone toolchain
- Host OS: Linux Debian 10
- ABI: armv7a-linux-androideabi16 i686-linux-android16
- NDK API level: 16
- Device API level: 29
It's not surprising that compile times are higher for large static arrays. In my experience, there's usually std::vector and std::string involved and removing STL dependencies sped things up. That doesn't seem to be the case here.
Looking at Clang's -ftime-trace and -ftime-report output, most of the time is spent in instruction selection:
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
390.5861 ( 99.6%) 0.0393 ( 11.7%) 390.6254 ( 99.5%) 390.6746 ( 99.5%) ARM Instruction Selection
In my workstation, I see the following times.
aarch64-linux-android21 0m5.695s
x86_64-linux-android21 0m44.086s
armv7a-linux-androideabi16 6m22.421s
i686-linux-android16 3m40.977s
Without reading too much into the patterns here, I'll file a bug in upstream to look at the bottlenecks in the ARM backend.
Thanks.
There are no dynamically-allocated types like std::string in the _settings[] array, there are only function pointers and pointers to global variables.
I reproduced the behavior with NDK 25 and Android API 21 (the example uses different API 16 for arm32, and doesn't compile with more recent versions of the NDK unless the API level is increased)
Some things I have discovered:
-
Only happens when optimizing. Remove
-Ozand it compiles quickly. -
If I always set the SettingDescBase::max field to 0, the compilation speed is much faster, and that speedup happens across architectures:
#define NSD_GENERAL(name, def, cmd, guiflags, min, max, interval, many, str, strhelp, strval, proc, load, cat)\
{name, (const void*)(size_t)(def), cmd, guiflags, min, 0, interval, many, str, strhelp, strval, proc, load, cat}
With SettingDescBase::max set to 0:
clang++ --target=aarch64-linux-android21
real 0m10.778s
user 0m10.509s
sys 0m0.268s
clang++ --target=x86_64-linux-android21
real 0m7.659s
user 0m7.399s
sys 0m0.260s
clang++ --target=armv7a-linux-androideabi16 -mthumb
real 0m33.410s
user 0m33.215s
sys 0m0.188s
clang++ --target=i686-linux-android16 -mstackrealign
real 0m11.222s
user 0m10.648s
sys 0m0.360s
Default behavior, preserving the value of max (note that I killed clang++ on the 32-bit builds after ~4 mins):
clang++ --target=aarch64-linux-android21
real 0m17.448s
user 0m17.221s
sys 0m0.224s
clang++ --target=x86_64-linux-android21
real 0m52.613s
user 0m52.368s
sys 0m0.236s
clang++ --target=armv7a-linux-androideabi16 -mthumb
Terminated
real 3m59.151s
user 3m58.961s
sys 0m0.136s
clang++ --target=i686-linux-android16 -mstackrealign
Terminated
real 3m43.530s
user 3m43.386s
sys 0m0.104s
Some more things I've discovered:
Clang turns global variable initialization into a function __cxx_global_var_init. In the IR, that function is marked "please optimize me":
; Function Attrs: minsize nounwind optsize
define internal fastcc void @__cxx_global_var_init.15() unnamed_addr #2 section ".text.startup" !dbg !7902 {
For a typical function, you could turn off optimization with __attribute__((optnone)), but you can't do that here because the front-end doesn't recognize optnone as valid in the context of declaring a global variable; warning: 'optnone' attribute only applies to functions and Objective-C methods [-Wignored-attributes]
If I look at the IR, the array is initialized with a mixture of stores, like this:
store i32 0, i32* getelementptr inbounds ([171 x %struct.SettingDesc], [171 x %struct.SettingDesc]* @_ZL9_settings, i32 0, i32 0, i32 0, i32 5), align 4, !dbg !7903, !tbaa !7913
and a few memsets like this:
call void @llvm.memset.p0i8.i64(i8* nonnull align 2 dereferenceable(43) bitcast (i16* getelementptr inbounds ([171 x %struct.SettingDesc], [171 x %struct.SettingDesc]* @_ZL9_settings, i32 0, i32 1, i32 0, i32 3) to i8*), i8 0, i64 43, i1 false), !dbg !7932
In the original code, there are 15 memsets. When forcing max to be 0, that jumps to 111.
Our current hypothesis is that something in const SettingDesc _settings[] can't be evaluated as constexpr, and so a special __cxx_global_var_init function is being called to initialize the array.
That being said, the compiler is still doing the wrong thing here.
Upstream bug: https://github.com/llvm/llvm-project/issues/55798