mace
mace copied to clipboard
[Crash] MACE crashes in cl_a5x_cmdbuf_mgr_submit_ibs on Xiaomi Redmi 4 Pro
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Android
- NDK version(e.g., 15c): r16b
- MACE version (Use the command: git describe --long --tags): Built from https://github.com/XiaoMi/mace/commits/fef913af05fb1651ee58af8818f8afa0f6693c57
- Device: Xiaomi Redmi 4 Pro
Model deploy file (*.yml)
- Custom model
Describe the problem
- Mace crashes in ResizeNearestNeighbour
fault :
thread : 14783 >>> Inference <<<
signal : SIGABRT
code : SI_TKILL
abort message : Exiting the process com.xyz.android from function cl_a5x_cmdbuf_mgr_submit_ibs and line 919
threads :
14783 :
name : Inference
crashed : true
0 : 00044358 /system/lib/libc.so (tgkill+12)
1 : 00041f5b /system/lib/libc.so (pthread_kill+34)
2 : 0001ba71 /system/lib/libc.so (raise+12)
3 : 00018c13 /system/lib/libc.so (__libc_android_abort+36)
4 : 000167d2 /system/lib/libc.so (abort+6)
5 : 00002e99 /system/lib/liblog.so (__android_log_assert+88)
6 : 0002343f /system/vendor/lib/libgsl.so (os_exit+30)
7 : 0004cc77 /system/vendor/lib/libCB.so (cl_a5x_cmdbuf_mgr_submit_ibs+666)
8 : 00025e41 /system/vendor/lib/libCB.so (cb_release_command_queue+104)
9 : 00009279 /system/vendor/lib/libOpenCL.so (qCLDrvAPI_clReleaseCommandQueue+24)
10 : |
000c15d9 /data/app/com.xyz.android-2/lib/arm/libabc.so
clReleaseCommandQueue at libgcc2.c:?
11 : |
00044605 /data/app/com.xyz.android-2/lib/arm/libabc.so
cl::detail::Wrapper<_cl_command_queue*>::~Wrapper() at libgcc2.c:?
12 : |
000dd553 /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::OpenCLAllocator::Unmap(void*, void*) const at libgcc2.c:?
13 : |
0004939f /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::Buffer::UnMap(void*) const at libgcc2.c:?
14 : |
000494c7 /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::Buffer::UnMap() at libgcc2.c:?
15 : |
000470b9 /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::Tensor::MappingGuard::~MappingGuard() at libgcc2.c:?
16 : |
00054aa5 /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::ops::opencl::image::ResizeNearestNeighborKernel<half_float::half>::Compute(mace::OpContext*, mace::Tensor const*, mace::Tensor const*, mace::Tensor*) at libgcc2.c:?
17 : |
00052ecf /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::ops::ResizeNearestNeighborOp<(mace::DeviceType)2, float>::Run(mace::OpContext*) at libgcc2.c:?
18 : |
000d2c59 /data/app/com.xyz.android-2/lib/arm/libabc.so
mace::SerialNet::Run(mace::RunMetadata*) at libgcc2.c:?
Additional context
- This is a crash we see in production when running a model using
ResizeNearestNeighborKernel
(primarily onAdreno 50x
GPUs). - The crash in not 100% reproducible when trying locally, but we see the above crash reports.
- Any pointers on what might be the root cause and potential bugfix?
Is this crash reproducible one other devices? The stack points to memcpy from cpu to gpu device, but only a size data...
@yejw5 yep, we are seeing it on other devices as well. One thing I noticed is that the code creates a copy of the command queue instead of taking by reference in https://github.com/XiaoMi/mace/blob/master/mace/core/runtime/opencl/opencl_allocator.cc#L169 (by using auto
instead of auto&
) is that expected?
@sumant85 It's Ok to use auto
, because of command queue inner refcount. Can you try the newest master branch?
cl::CommandQueue is reference counted pointer, and here is the source: https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L1540 https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L1783 https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L6632
We have encountered similar crash inside libOpenCL and finally it turned out it's caused by memory corruption by other module. To check whether there is memory related issue, you can use ASAN by turn this on (https://github.com/XiaoMi/mace/blob/master/tools/bazel.rc#L96). And you could check external native code using this.