mace icon indicating copy to clipboard operation
mace copied to clipboard

[Crash] MACE crashes in cl_a5x_cmdbuf_mgr_submit_ibs on Xiaomi Redmi 4 Pro

Open sumant85 opened this issue 5 years ago • 4 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Android
  • NDK version(e.g., 15c): r16b
  • MACE version (Use the command: git describe --long --tags): Built from https://github.com/XiaoMi/mace/commits/fef913af05fb1651ee58af8818f8afa0f6693c57
  • Device: Xiaomi Redmi 4 Pro

Model deploy file (*.yml)

  • Custom model

Describe the problem

  • Mace crashes in ResizeNearestNeighbour
fault :
    thread : 14783 >>> Inference <<<
    signal : SIGABRT
    code : SI_TKILL
    abort message : Exiting the process com.xyz.android from function cl_a5x_cmdbuf_mgr_submit_ibs and line 919
threads :
    14783 :
        name : Inference
        crashed : true
        0 : 00044358 /system/lib/libc.so (tgkill+12)
        1 : 00041f5b /system/lib/libc.so (pthread_kill+34)
        2 : 0001ba71 /system/lib/libc.so (raise+12)
        3 : 00018c13 /system/lib/libc.so (__libc_android_abort+36)
        4 : 000167d2 /system/lib/libc.so (abort+6)
        5 : 00002e99 /system/lib/liblog.so (__android_log_assert+88)
        6 : 0002343f /system/vendor/lib/libgsl.so (os_exit+30)
        7 : 0004cc77 /system/vendor/lib/libCB.so (cl_a5x_cmdbuf_mgr_submit_ibs+666)
        8 : 00025e41 /system/vendor/lib/libCB.so (cb_release_command_queue+104)
        9 : 00009279 /system/vendor/lib/libOpenCL.so (qCLDrvAPI_clReleaseCommandQueue+24)
        10 : |
             000c15d9 /data/app/com.xyz.android-2/lib/arm/libabc.so
             clReleaseCommandQueue at libgcc2.c:?
        11 : |
             00044605 /data/app/com.xyz.android-2/lib/arm/libabc.so
             cl::detail::Wrapper<_cl_command_queue*>::~Wrapper() at libgcc2.c:?
        12 : |
             000dd553 /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::OpenCLAllocator::Unmap(void*, void*) const at libgcc2.c:?
        13 : |
             0004939f /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::Buffer::UnMap(void*) const at libgcc2.c:?
        14 : |
             000494c7 /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::Buffer::UnMap() at libgcc2.c:?
        15 : |
             000470b9 /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::Tensor::MappingGuard::~MappingGuard() at libgcc2.c:?
        16 : |
             00054aa5 /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::ops::opencl::image::ResizeNearestNeighborKernel<half_float::half>::Compute(mace::OpContext*, mace::Tensor const*, mace::Tensor const*, mace::Tensor*) at libgcc2.c:?
        17 : |
             00052ecf /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::ops::ResizeNearestNeighborOp<(mace::DeviceType)2, float>::Run(mace::OpContext*) at libgcc2.c:?
        18 : |
             000d2c59 /data/app/com.xyz.android-2/lib/arm/libabc.so
             mace::SerialNet::Run(mace::RunMetadata*) at libgcc2.c:?

Additional context

  • This is a crash we see in production when running a model using ResizeNearestNeighborKernel (primarily on Adreno 50x GPUs).
  • The crash in not 100% reproducible when trying locally, but we see the above crash reports.
  • Any pointers on what might be the root cause and potential bugfix?

sumant85 avatar Jun 18 '19 19:06 sumant85

Is this crash reproducible one other devices? The stack points to memcpy from cpu to gpu device, but only a size data...

yejw5 avatar Jun 24 '19 03:06 yejw5

@yejw5 yep, we are seeing it on other devices as well. One thing I noticed is that the code creates a copy of the command queue instead of taking by reference in https://github.com/XiaoMi/mace/blob/master/mace/core/runtime/opencl/opencl_allocator.cc#L169 (by using auto instead of auto&) is that expected?

sumant85 avatar Jun 24 '19 20:06 sumant85

@sumant85 It's Ok to use auto, because of command queue inner refcount. Can you try the newest master branch?

yejw5 avatar Jun 25 '19 06:06 yejw5

cl::CommandQueue is reference counted pointer, and here is the source: https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L1540 https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L1783 https://github.com/KhronosGroup/OpenCL-CLHPP/blob/master/input_cl2.hpp#L6632

We have encountered similar crash inside libOpenCL and finally it turned out it's caused by memory corruption by other module. To check whether there is memory related issue, you can use ASAN by turn this on (https://github.com/XiaoMi/mace/blob/master/tools/bazel.rc#L96). And you could check external native code using this.

llhe avatar Jun 25 '19 07:06 llhe