AI-on-the-edge-device
AI-on-the-edge-device copied to clipboard
Check for corrupted models
The Feature
We sporadically see issues that the system crashes after loading a model, eg. https://github.com/jomjol/AI-on-the-edge-device/discussions/3177#discussioncomment-10350419 Usually this is due a SD card going bad.
I think it would be wise if we would check the models before we use them (and prevent a crash). The best way would be to handle the issue if it is corrupted. Not sure if this is possible with the use tflite library. The other way would be to provide a 2nd file per model containing the CRC32 or MD5 sum. The firmware then could check against it and handle it.
It crashes here:
this->interpreter = new tflite::MicroInterpreter(this->model, resolver, this->tensor_arena, this->kTensorArenaSize);
https://github.com/jomjol/AI-on-the-edge-device/blob/rolling/code/components/jomjol_tfliteclass/CTfLiteClass.cpp#L208
@Slider0007 @SybexX @jomjol Do you have experience with enabling exception handling?
IMO this is the only way to catch the crash which is inside the tflite library.
How ever I am unable to enable exception handling in the platformio.ini file.
What ever I do, I get error: exception handling disabled, use '-fexceptions' to enable but I already replaced -fno-exceptions with -fexceptions...
@caco3: I've never used exception handling in ESP IDF environment. Therefore I cannot assist with this topic. As I understand this correct, this could be tricky because every potential exception all over the software needs a catch otherwise processing is getting aborted in error case. Would potentially a lot of work...
Beside execption handling at least a sort of version check could be added. Maybe this helps a bit depending on how the file is getting corrupted. The question is what should be the reaction because the flow in jomjol firmware cannot be aborted gracefully anyway...
https://github.com/Slider0007/AI-on-the-edge-device/blob/7f14d89bc013f6db145eac343d90b4b457ae11b3/code/components/jomjol_tfliteclass/CTfLiteClass.cpp#L325
Nevertehless I raise the question what will the user do if it's not crashing but still not working anymore because the model is corrupted anyway? This does not solve the root cause, wearing SD cards quickly because of tons of reading cycles of the same files...
this could be tricky because every potential exception all over the software needs a catch otherwise processing is getting aborted in error case. Would potentially a lot of work
Well, without exceptions (as is now), it goes directly into an abort!
The question is what should be the reaction
The crash happens within the tflite library. The right way would be to patch that one, but I fear it might by a lot of learning first. The other way is to add a try/catch around it. This way we can notify the user gracefully without crashing. As of now, I don't know how to do it, so the only thing we can do is add a debug log message just before that call. The current implementation is that on a crash, we stay in DEBUG log level and delay the first round by 5 minutes. This way we will see the log message indication the issue. See the example in https://github.com/jomjol/AI-on-the-edge-device/pull/3220.
Your proposal with the version check sounds as a good start. but it will not be able to catch all corruptions.
Nevertehless I raise the question what will the user do if it's not crashing but still not working anymore because the model is corrupted anyway?
We simply can show an error in the UI/MQTT, ...
This does not solve the root cause, wearing SD cards quickly because of tons of reading cycles of the same files...
Yes, thats right, but I have the feeling we had quite some bug reports because of corrupted models/filesystems. Because of this I investigated and saw that there actually is no validation.
https://github.com/jomjol/AI-on-the-edge-device/pull/3220