ndk icon indicating copy to clipboard operation
ndk copied to clipboard

[BUG] Awfully slow compilation of a source file with a big global array on 32-bit targets

Open pelya opened this issue 4 years ago • 7 comments


Description

Clang++ compilation of one specific source file in my project is terribly slow for 32-bit architectures. On my Core i7-3770K, compiling aarch64 object file takes 13 seconds, but compiling armv7a object file takes 32 minutes, which is 150 times slower. I am using -Oz optimization, but the results are the same with -O2.

The problem seems to happen because of a big globally-initialized array, which has around 390 entries. If I edit file src/table/settings.h and replace the array const SettingDesc _settings[] with an empty array, compilation takes 2 seconds on all targets. If I reduce array size to 100 items, compilation of armv7a target is reduced to 24 seconds.

To reproduce: unpack slow-compile.zip, launch script slow-compile.sh, it will report compilation speed using time command. I have included the source file and all the headers it uses into the archive.

slow-compile.zip

Environment Details

  • NDK Version: 22.0.7026061
  • Build system: Standalone toolchain
  • Host OS: Linux Debian 10
  • ABI: armv7a-linux-androideabi16 i686-linux-android16
  • NDK API level: 16
  • Device API level: 29

pelya avatar Jan 01 '21 02:01 pelya

It's not surprising that compile times are higher for large static arrays. In my experience, there's usually std::vector and std::string involved and removing STL dependencies sped things up. That doesn't seem to be the case here.

Looking at Clang's -ftime-trace and -ftime-report output, most of the time is spent in instruction selection:

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---                                  
  390.5861 ( 99.6%)   0.0393 ( 11.7%)  390.6254 ( 99.5%)  390.6746 ( 99.5%)  ARM Instruction Selection                  

In my workstation, I see the following times.

aarch64-linux-android21  0m5.695s
x86_64-linux-android21  0m44.086s
armv7a-linux-androideabi16 6m22.421s
i686-linux-android16 3m40.977s 

Without reading too much into the patterns here, I'll file a bug in upstream to look at the bottlenecks in the ARM backend.

pirama-arumuga-nainar avatar Jan 06 '21 18:01 pirama-arumuga-nainar

Thanks. There are no dynamically-allocated types like std::string in the _settings[] array, there are only function pointers and pointers to global variables.

pelya avatar Jan 06 '21 22:01 pelya

I reproduced the behavior with NDK 25 and Android API 21 (the example uses different API 16 for arm32, and doesn't compile with more recent versions of the NDK unless the API level is increased)

jfgoog avatar May 24 '22 00:05 jfgoog

Some things I have discovered:

  • Only happens when optimizing. Remove -Oz and it compiles quickly.

  • If I always set the SettingDescBase::max field to 0, the compilation speed is much faster, and that speedup happens across architectures:

#define NSD_GENERAL(name, def, cmd, guiflags, min, max, interval, many, str, strhelp, strval, proc, load, cat)\
	{name, (const void*)(size_t)(def), cmd, guiflags, min, 0, interval, many, str, strhelp, strval, proc, load, cat}

With SettingDescBase::max set to 0:

clang++ --target=aarch64-linux-android21

real    0m10.778s
user    0m10.509s
sys     0m0.268s


clang++ --target=x86_64-linux-android21

real    0m7.659s
user    0m7.399s
sys     0m0.260s


clang++ --target=armv7a-linux-androideabi16 -mthumb

real    0m33.410s
user    0m33.215s
sys     0m0.188s


clang++ --target=i686-linux-android16 -mstackrealign

real    0m11.222s
user    0m10.648s
sys     0m0.360s

Default behavior, preserving the value of max (note that I killed clang++ on the 32-bit builds after ~4 mins):

clang++ --target=aarch64-linux-android21

real    0m17.448s
user    0m17.221s
sys     0m0.224s


clang++ --target=x86_64-linux-android21

real    0m52.613s
user    0m52.368s
sys     0m0.236s


clang++ --target=armv7a-linux-androideabi16 -mthumb
Terminated

real    3m59.151s
user    3m58.961s
sys     0m0.136s


clang++ --target=i686-linux-android16 -mstackrealign
Terminated

real    3m43.530s
user    3m43.386s
sys     0m0.104s

jfgoog avatar May 24 '22 03:05 jfgoog

Some more things I've discovered:

Clang turns global variable initialization into a function __cxx_global_var_init. In the IR, that function is marked "please optimize me":

; Function Attrs: minsize nounwind optsize
define internal fastcc void @__cxx_global_var_init.15() unnamed_addr #2 section ".text.startup" !dbg !7902 {

For a typical function, you could turn off optimization with __attribute__((optnone)), but you can't do that here because the front-end doesn't recognize optnone as valid in the context of declaring a global variable; warning: 'optnone' attribute only applies to functions and Objective-C methods [-Wignored-attributes]

If I look at the IR, the array is initialized with a mixture of stores, like this:

  store i32 0, i32* getelementptr inbounds ([171 x %struct.SettingDesc], [171 x %struct.SettingDesc]* @_ZL9_settings, i32 0, i32 0, i32 0, i32 5), align 4, !dbg !7903, !tbaa !7913

and a few memsets like this:

  call void @llvm.memset.p0i8.i64(i8* nonnull align 2 dereferenceable(43) bitcast (i16* getelementptr inbounds ([171 x %struct.SettingDesc], [171 x %struct.SettingDesc]* @_ZL9_settings, i32 0, i32 1, i32 0, i32 3) to i8*), i8 0, i64 43, i1 false), !dbg !7932

In the original code, there are 15 memsets. When forcing max to be 0, that jumps to 111.

jfgoog avatar May 24 '22 14:05 jfgoog

Our current hypothesis is that something in const SettingDesc _settings[] can't be evaluated as constexpr, and so a special __cxx_global_var_init function is being called to initialize the array.

That being said, the compiler is still doing the wrong thing here.

jfgoog avatar May 24 '22 20:05 jfgoog

Upstream bug: https://github.com/llvm/llvm-project/issues/55798

jfgoog avatar May 31 '22 17:05 jfgoog