[Mobile] Custom build for Android often crashes due to insufficient JVM heap size
Describe the issue
Following the custom build command for Android, I frequently run into build crashes at random places within the build process.
One example (stack trace truncated as it is too long):
...
[353/832] Building CXX object CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o
FAILED: CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o
"/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++" --target=i686-none-linux-android21 --sysroot="/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/lin
ux-x86_64/sysroot" -DCPUINFO_SUPPORTED_PLATFORM=1 -DDISABLE_FLOAT8_TYPES -DDISABLE_ML_OPS -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DJSON_NOEXCEPTION -DMLAS_NO_EXCEPTION -DNSYNC_ATOMIC_CPP11 -DONNX_ML -DONNX_NAMESPA
CE=onnx -DONNX_NO_EXCEPTIONS -DONNX_USE_LITE_PROTO -DORT_EXTENDED_MINIMAL_BUILD -DORT_MINIMAL_BUILD -DORT_NO_EXCEPTIONS -DORT_NO_RTTI -DPLATFORM_POSIX -DREDUCED_OPS_BUILD -DUSE_NNAPI=1 -D__ONNX_NO_DOC_STRINGS -I/
workspace/build/intermediates/x86/Release/_deps/utf8_range-src -I/workspace/onnxruntime/include/onnxruntime -I/workspace/onnxruntime/include/onnxruntime/core/session -I/workspace/build/intermediates/x86/Release/_
deps/pytorch_cpuinfo-src/include -I/workspace/build/intermediates/x86/Release/_deps/google_nsync-src/public -I/workspace/build/intermediates/x86/Release -I/workspace/onnxruntime/onnxruntime -I/workspace/build/int
ermediates/x86/Release/_deps/abseil_cpp-src -I/workspace/build/intermediates/x86/Release/_deps/safeint-src -I/workspace/build/intermediates/x86/Release/_deps/gsl-src/include -I/workspace/build/intermediates/x86/R
elease/_deps/onnx-src -I/workspace/build/intermediates/x86/Release/_deps/protobuf-src/src -I/workspace/build/intermediates/x86/Release/_deps/flatbuffers-src/include -I/workspace/build/intermediates/x86/Release/_d
eps/mp11-src/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -mstackrealign -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -ffuncti
on-sections -fdata-sections -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -DCPUINFO_SUPPORTED -O3 -DNDEBUG -O3 -std=gnu++17 -fPIC -fno-rtti -Wall -Wextra -Wno-unused-parameter -Wno-deprecate
d-copy -Wno-tautological-pointer-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Wno-unknown-pragmas -We
rror -MD -MT CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -MF CMakeFiles/onnxruntime_providers_nnapi.dir/
workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o.d -o CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/
nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -c /workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc
PLEASE submit a bug report to https://github.com/android-ndk/ndk/issues and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0. Program arguments: /workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=i686-none-linux-android21 --sysroot=/workspace/~/android-sdk/ndk/26.1.10909125/tool
chains/llvm/prebuilt/linux-x86_64/sysroot -DCPUINFO_SUPPORTED_PLATFORM=1 -DDISABLE_FLOAT8_TYPES -DDISABLE_ML_OPS -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DJSON_NOEXCEPTION -DMLAS_NO_EXCEPTION -DNSYNC_ATOMIC_CPP11 -
DONNX_ML -DONNX_NAMESPACE=onnx -DONNX_NO_EXCEPTIONS -DONNX_USE_LITE_PROTO -DORT_EXTENDED_MINIMAL_BUILD -DORT_MINIMAL_BUILD -DORT_NO_EXCEPTIONS -DORT_NO_RTTI -DPLATFORM_POSIX -DREDUCED_OPS_BUILD -DUSE_NNAPI=1 -D__
ONNX_NO_DOC_STRINGS -I/workspace/build/intermediates/x86/Release/_deps/utf8_range-src -I/workspace/onnxruntime/include/onnxruntime -I/workspace/onnxruntime/include/onnxruntime/core/session -I/workspace/build/inte
rmediates/x86/Release/_deps/pytorch_cpuinfo-src/include -I/workspace/build/intermediates/x86/Release/_deps/google_nsync-src/public -I/workspace/build/intermediates/x86/Release -I/workspace/onnxruntime/onnxruntime
-I/workspace/build/intermediates/x86/Release/_deps/abseil_cpp-src -I/workspace/build/intermediates/x86/Release/_deps/safeint-src -I/workspace/build/intermediates/x86/Release/_deps/gsl-src/include -I/workspace/bu
ild/intermediates/x86/Release/_deps/onnx-src -I/workspace/build/intermediates/x86/Release/_deps/protobuf-src/src -I/workspace/build/intermediates/x86/Release/_deps/flatbuffers-src/include -I/workspace/build/inter
mediates/x86/Release/_deps/mp11-src/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -mstackrealign -D_FORTIFY_SOURCE=2 -Wformat -Werror=for
mat-security -ffunction-sections -fdata-sections -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -DCPUINFO_SUPPORTED -O3 -DNDEBUG -O3 -std=gnu++17 -fPIC -fno-rtti -Wall -Wextra -Wno-unused-para
meter -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Wno
-unknown-pragmas -Werror -MD -MT CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -MF CMakeFiles/onnxruntime_
providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o.d -o CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/c
ore/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -c /workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module '/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc'.
4. Running pass 'X86 DAG->DAG Instruction Selection' on function '@_ZNK11onnxruntime5nnapi18LeakyReluOpBuilder21AddToModelBuilderImplERNS0_12ModelBuilderERKNS_8NodeUnitE'
#0 0x000056402a437445 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a37445)
#1 0x000056402a4364b0 llvm::sys::RunSignalHandlers() (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a364b0)
#2 0x000056402a401bae (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a01bae)
#3 0x000056402a401d16 (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a01d16)
#4 0x00007f7e55cd5420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#5 0x000056402c40c0da llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x6a0
c0da)
#6 0x00007ffce044a390
...
The crash might happen at different places at different runs, with a similar behaviour which the NDK would dump a stack trace.
Since build_custom_android_package.py invokes a docker run command, I suspect that this is due to insufficient JVM heap memory within the Docker container during building.
To reproduce
Simply clone onnxruntime and run the custom build command as per documentation:
python3 tools/android_custom_build/build_custom_android_package.py \
--onnxruntime_branch_or_tag v<ORT version> \
--include_ops_by_config /path/to/ops.config \
--build_settings /path/to/build_settings.json \
/path/to/working/dir
The issue is regardless of ORT versions.
Urgency
No response
Platform
Android
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Built from Source
Compiler Version (if 'Built from Source')
Custom Android build using Docker
Package Name (if 'Released Package')
None
ONNX Runtime Version or Commit ID
v1.17.0 (but issue happens for any versions)
ONNX Runtime API
Java/Kotlin
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
I find that this can be solved if the JVM max memory size is set in the Docker run flags, e.g.
docker run ... -e JAVA_OPTS="-Xms16G -Xmx16G" ...
I would suggest to add these flags into build_custom_android_package.py under docker_container_build_cmd. In my case, -Xms16G -Xmx16G works, but reducing it to 8G would crash. Would be happy to submit a PR if this solution is okay.
UPDATE: After some investigation, can confirm that https://github.com/microsoft/onnxruntime/issues/19584 is an individual machine issue (that machine is faulty and often overheated, which throws random errors at very high temperature (~100 celsius)). We ran the build script on another machine and builds successfully. Can also confirm that the JVM fix above is irrelevant to the issue.
We do face OOM issue sometimes if the Docker container has insufficient memory (as I observe that the build process might need up to 16+ GB of memory), and can be solved by adding Docker env variables e.g. -m 32g. Which means https://github.com/microsoft/onnxruntime/pull/19630 should be a helpful fix.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.