deepdetect icon indicating copy to clipboard operation
deepdetect copied to clipboard

Core dump "Could not decode datum" during training

Open YaYaB opened this issue 5 years ago • 2 comments

If Ok, please give as many details as possible to help us solve the problem more efficiently.

Configuration

  • Version of DeepDetect:
    • [X] Locally compiled on:
      • [X] Ubuntu 14.04 LTS
      • [ ] Mac OSX
      • [ ] Other:
    • [ ] Docker
    • [ ] Amazon AMI
  • Commit (shown by the server when starting): ecdfad8658e5e8f14ac481b5729ea801653321c8

Your question / the problem you're facing:

I've launched a training for an image model. Everything went well during the lmdb creation (no errors seen). However at some point during the training I got a core dump. Note that it was during the second epoch of my training so all the data has been seen and the test set has been predicted one time.

Error message (if any) / steps to reproduce the problem:

Here are the logs I obtained when it core dumped/

  • [X] Server log output:
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng warning: Ignoring bad adaptive filter type
libpng error: IDAT: CRC error
[2020-07-24 10:06:14.222] [caffe] [error] Could not decode datum 
terminate called after throwing an instance of 'CaffeErrorException'
  what():  src/caffe/data_transformer.cpp:895 / Check failed (custom): cv_cropped_image.data
[1]    5337 abort (core dumped)  ./dede --port 8081

I've searched a bit, it might be due to a corrupted image but I don't understand how it worked correctly in the first epoch if it is the case.

YaYaB avatar Jul 24 '20 10:07 YaYaB

Hi, libpng says it, there's an issue with an image somewhere. Best way is to write a script that decodes all images to decode all images.

To debug if it's an object detector being trained, you can also try setting this check_size variable to true: https://github.com/jolibrain/deepdetect/blob/master/src/backends/caffe/caffeinputconns.cc#L871

If the two tests above do not show anything wrong, you can try deactivating all the pragma in this layer, starting here: https://github.com/jolibrain/caffe/blob/master/src/caffe/layers/annotated_data_layer.cpp#L164

But my hunch is you have a bad png somewhere. I don't know about epochs or so, data augmentation is randomized and datum are prefetched with three threads.

beniz avatar Jul 27 '20 06:07 beniz

Yeah I may have some weird pngs, I tried decode all those but it seemed okay.. I'll try again to see

YaYaB avatar Jul 28 '20 09:07 YaYaB