onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[Mobile] Custom build for Android often crashes due to insufficient JVM heap size

Open gudgud96 opened this issue 1 year ago • 1 comments

Describe the issue

Following the custom build command for Android, I frequently run into build crashes at random places within the build process.

One example (stack trace truncated as it is too long):

...
[353/832] Building CXX object CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o                               
FAILED: CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o                                                     
"/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++" --target=i686-none-linux-android21 --sysroot="/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/lin
ux-x86_64/sysroot" -DCPUINFO_SUPPORTED_PLATFORM=1 -DDISABLE_FLOAT8_TYPES -DDISABLE_ML_OPS -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DJSON_NOEXCEPTION -DMLAS_NO_EXCEPTION -DNSYNC_ATOMIC_CPP11 -DONNX_ML -DONNX_NAMESPA
CE=onnx -DONNX_NO_EXCEPTIONS -DONNX_USE_LITE_PROTO -DORT_EXTENDED_MINIMAL_BUILD -DORT_MINIMAL_BUILD -DORT_NO_EXCEPTIONS -DORT_NO_RTTI -DPLATFORM_POSIX -DREDUCED_OPS_BUILD -DUSE_NNAPI=1 -D__ONNX_NO_DOC_STRINGS -I/
workspace/build/intermediates/x86/Release/_deps/utf8_range-src -I/workspace/onnxruntime/include/onnxruntime -I/workspace/onnxruntime/include/onnxruntime/core/session -I/workspace/build/intermediates/x86/Release/_
deps/pytorch_cpuinfo-src/include -I/workspace/build/intermediates/x86/Release/_deps/google_nsync-src/public -I/workspace/build/intermediates/x86/Release -I/workspace/onnxruntime/onnxruntime -I/workspace/build/int
ermediates/x86/Release/_deps/abseil_cpp-src -I/workspace/build/intermediates/x86/Release/_deps/safeint-src -I/workspace/build/intermediates/x86/Release/_deps/gsl-src/include -I/workspace/build/intermediates/x86/R
elease/_deps/onnx-src -I/workspace/build/intermediates/x86/Release/_deps/protobuf-src/src -I/workspace/build/intermediates/x86/Release/_deps/flatbuffers-src/include -I/workspace/build/intermediates/x86/Release/_d
eps/mp11-src/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -mstackrealign -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security   -ffuncti
on-sections -fdata-sections -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -DCPUINFO_SUPPORTED -O3 -DNDEBUG  -O3 -std=gnu++17 -fPIC -fno-rtti -Wall -Wextra -Wno-unused-parameter -Wno-deprecate
d-copy -Wno-tautological-pointer-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Wno-unknown-pragmas -We
rror -MD -MT CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -MF CMakeFiles/onnxruntime_providers_nnapi.dir/
workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o.d -o CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/
nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -c /workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc                                                
PLEASE submit a bug report to https://github.com/android-ndk/ndk/issues and include the crash backtrace, preprocessed source, and associated run script.                                                            
Stack dump:                                                                                                                                                                                                         
0.      Program arguments: /workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=i686-none-linux-android21 --sysroot=/workspace/~/android-sdk/ndk/26.1.10909125/tool
chains/llvm/prebuilt/linux-x86_64/sysroot -DCPUINFO_SUPPORTED_PLATFORM=1 -DDISABLE_FLOAT8_TYPES -DDISABLE_ML_OPS -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DJSON_NOEXCEPTION -DMLAS_NO_EXCEPTION -DNSYNC_ATOMIC_CPP11 -
DONNX_ML -DONNX_NAMESPACE=onnx -DONNX_NO_EXCEPTIONS -DONNX_USE_LITE_PROTO -DORT_EXTENDED_MINIMAL_BUILD -DORT_MINIMAL_BUILD -DORT_NO_EXCEPTIONS -DORT_NO_RTTI -DPLATFORM_POSIX -DREDUCED_OPS_BUILD -DUSE_NNAPI=1 -D__
ONNX_NO_DOC_STRINGS -I/workspace/build/intermediates/x86/Release/_deps/utf8_range-src -I/workspace/onnxruntime/include/onnxruntime -I/workspace/onnxruntime/include/onnxruntime/core/session -I/workspace/build/inte
rmediates/x86/Release/_deps/pytorch_cpuinfo-src/include -I/workspace/build/intermediates/x86/Release/_deps/google_nsync-src/public -I/workspace/build/intermediates/x86/Release -I/workspace/onnxruntime/onnxruntime
 -I/workspace/build/intermediates/x86/Release/_deps/abseil_cpp-src -I/workspace/build/intermediates/x86/Release/_deps/safeint-src -I/workspace/build/intermediates/x86/Release/_deps/gsl-src/include -I/workspace/bu
ild/intermediates/x86/Release/_deps/onnx-src -I/workspace/build/intermediates/x86/Release/_deps/protobuf-src/src -I/workspace/build/intermediates/x86/Release/_deps/flatbuffers-src/include -I/workspace/build/inter
mediates/x86/Release/_deps/mp11-src/include -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -mstackrealign -D_FORTIFY_SOURCE=2 -Wformat -Werror=for
mat-security -ffunction-sections -fdata-sections -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -DCPUINFO_SUPPORTED -O3 -DNDEBUG -O3 -std=gnu++17 -fPIC -fno-rtti -Wall -Wextra -Wno-unused-para
meter -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Wno
-unknown-pragmas -Werror -MD -MT CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -MF CMakeFiles/onnxruntime_
providers_nnapi.dir/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o.d -o CMakeFiles/onnxruntime_providers_nnapi.dir/workspace/onnxruntime/onnxruntime/c
ore/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc.o -c /workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc                            
1.      <eof> parser at end of file                                                                                                                                                                                 
2.      Code generation                                                                                                                                                                                             
3.      Running pass 'Function Pass Manager' on module '/workspace/onnxruntime/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/impl/leakyrelu_op_builder.cc'.                                               
4.      Running pass 'X86 DAG->DAG Instruction Selection' on function '@_ZNK11onnxruntime5nnapi18LeakyReluOpBuilder21AddToModelBuilderImplERNS0_12ModelBuilderERKNS_8NodeUnitE'                                     
#0 0x000056402a437445 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a37445)                                  
#1 0x000056402a4364b0 llvm::sys::RunSignalHandlers() (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a364b0)                                                       
#2 0x000056402a401bae (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a01bae)                                                                                      
#3 0x000056402a401d16 (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x4a01d16)                                                                                      
#4 0x00007f7e55cd5420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#5 0x000056402c40c0da llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) (/workspace/~/android-sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/linux-x86_64/bin/clang+++0x6a0
c0da)
#6 0x00007ffce044a390 
...

The crash might happen at different places at different runs, with a similar behaviour which the NDK would dump a stack trace.

Since build_custom_android_package.py invokes a docker run command, I suspect that this is due to insufficient JVM heap memory within the Docker container during building.

To reproduce

Simply clone onnxruntime and run the custom build command as per documentation:

python3 tools/android_custom_build/build_custom_android_package.py \
   --onnxruntime_branch_or_tag v<ORT version> \
   --include_ops_by_config /path/to/ops.config \
   --build_settings /path/to/build_settings.json \
   /path/to/working/dir

The issue is regardless of ORT versions.

Urgency

No response

Platform

Android

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

Compiler Version (if 'Built from Source')

Custom Android build using Docker

Package Name (if 'Released Package')

None

ONNX Runtime Version or Commit ID

v1.17.0 (but issue happens for any versions)

ONNX Runtime API

Java/Kotlin

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

gudgud96 avatar Feb 21 '24 02:02 gudgud96

I find that this can be solved if the JVM max memory size is set in the Docker run flags, e.g.

docker run ... -e JAVA_OPTS="-Xms16G -Xmx16G" ...

I would suggest to add these flags into build_custom_android_package.py under docker_container_build_cmd. In my case, -Xms16G -Xmx16G works, but reducing it to 8G would crash. Would be happy to submit a PR if this solution is okay.

gudgud96 avatar Feb 21 '24 02:02 gudgud96

UPDATE: After some investigation, can confirm that https://github.com/microsoft/onnxruntime/issues/19584 is an individual machine issue (that machine is faulty and often overheated, which throws random errors at very high temperature (~100 celsius)). We ran the build script on another machine and builds successfully. Can also confirm that the JVM fix above is irrelevant to the issue.

We do face OOM issue sometimes if the Docker container has insufficient memory (as I observe that the build process might need up to 16+ GB of memory), and can be solved by adding Docker env variables e.g. -m 32g. Which means https://github.com/microsoft/onnxruntime/pull/19630 should be a helpful fix.

gudgud96 avatar Feb 29 '24 06:02 gudgud96

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions[bot] avatar Mar 30 '24 15:03 github-actions[bot]