mace
mace copied to clipboard
LOG_FATAL aborts instead of throwing exception
Before you open an issue, please make sure you have tried the following steps:
- Make sure your environment is the same with (https://mace.readthedocs.io/en/latest/installation/env_requirement.html).
- Have you ever read the document for your usage?
- Check if your issue appears in HOW-TO-DEBUG or FAQ.
- The form below must be filled.
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.13.6
- NDK version(e.g., 15c): r16b
- GCC version(if compiling for host, e.g., 5.4.0):
- MACE version (Use the command: git describe --long --tags): v0.10.0
- Python version(2.7): 2.7
- Bazel version (e.g., 0.13.0): 0.20.0-homebrew
Model deploy file (*.yml)
......
Describe the problem
- MACE LOG_FATAL currently calls
abort()
https://github.com/XiaoMi/mace/blob/master/mace/utils/logging.cc#L33. - This requires that we implement a signal handler and makes it harder to fall back to some other inference engine. Ideally, we expect that MACE would throw an exception that can be caught externally, and then the app can decide the appropriate action.
- Also, it is a bit unexpected for that MACE would call abort() and is most likely left unhandled until someone actually runs into this issue (causing app to crash in production in the worst case). Raising exceptions would be a more standard approach here IMO.
After reading through the code a bit more, it looks like throwing exception would not work since that call is made from the destructor. However, I think we should still come up with a solution which doesn't crash the app.
@sumant85 This is by design. MACE is a fundamental library instead of a user facing application level code, so the design is focused on correctness and we adopt fail-fast instead of fail-safe philosophy.
All the MACE_CHECK or LOG(FATAL) means coding error and all logic error are handled as error status. If you find some code breaks this guideline, please point out (there may be some places, one example, we abort when there is OOM before, and realized later this could happen and recoverable by function degradation so we remove the abort).
For more information about fail-fast and fail-safe, you can refer to ch-8 of Code Complete.
@llhe Thanks for providing additional context!
Although MACE isn't meant to be user-facing code, I would argue that in the mobile app setting, applications using this library would want to handle failure more gracefully than the app crashing. Given the wide range of android devices, it becomes very difficult to test whether a given model would not cause a crash across all devices. In our case, we were running into https://github.com/XiaoMi/mace/blob/master/mace/ops/opencl/helper.cc#L290 on one specific device only when using a particular model (and there may be other crashes we don't know about). The abort()
bypasses all our error-handling and crashes the app. If this reaches production, it would take about 2-3 days (best case) to roll out a fix, for a feature that might be a secondary function within the application (depending on the use)
However, it's great that MACE is open-source, so we can patch this for our use case :) But in my opinion, throwing an exception and letting the application decide how to deal with the error would be more suited for mobile development. We've seen great results with this library so far, keep up the good work!
@sumant85 Thanks for pointing out. It makes sense. These OpenCL platform related code should be handled as runtime errors instead of assert, we should fix them. It's will be great if you can point out the places or submit a PR.