deepdetect Caffe services are not using any mean.binaryproto files

Configuration

Version of DeepDetect:
- [x] Docker
Commit (shown by the server when starting): 754dedc7d23f6cf75dde15775f73f4aa95641e76, but based on code review also present in most recent commit in the master branch.

Your question / the problem you're facing:

Caffe models when loaded during service creation do not pick up associated mean.binaryproto files in the same folder. Even though this line appears to check for the existence of the mean.binaryproto file, the check for _has_mean_file seems to fail here. I verified that by looking back at the commit 6a35a4a2e8afc83561dbbc85468177d9778a5217 where that line was originally: if (_data_mean.count() == 0 && fileops::file_exists(meanfullname))

I patched the current version to use that file check instead of using _has_mean_file and it now is loading the mean.binaryproto file correctly.

I verified that it was not loading the mean file with the original master code and that it was loading the mean file with the patch I described above by running the dede binary through strace (e.g. strace -o trace -ff ./dede -host 0.0.0.0 -port 8080) and grep-ing the log files for any reference to the mean.binaryproto file. Without the patch, it never referenced the file, and so never opened the file:

$ grep "mean" trace.*
$

With the patch, it does:

$ grep -i "mean" trace*
trace.4701:stat("/opt/models/hybridCNN//mean.binaryproto", {st_mode=S_IFREG|0644, st_size=786446, ...}) = 0
trace.4701:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15

It's a large diff, but you can see the code changes here. Check file src/caffeinputconns.h, line 137. I'm not sure why the file check in src/caffemodel.cc doesn't work correctly, though.

Error message (if any) / steps to reproduce the problem:

[x] list of API calls: Create a service:

curl -X PUT "http://localhost:8080/services/hybridcnn0" -d '{
  "mllib":"caffe",
  "description":"hybridCNN",
  "type":"supervised",
  "parameters":{
    "input":{
      "connector":"image",
      "width":227,                                                                                     
      "height":227
    },
    "mllib":{
      "nclasses":1183
    }
  },
  "model":{
    "repository":"/opt/models/hybridCNN/"
  }
}'

Predict a test image on the service:

curl -X POST "http://localhost:8080/predict" -d '{
       "service":"hybridcnn0",
       "parameters":{
         "output":{
           "best":2
          }
         },               
         "data":["sample_000001.jpg"]
}'

[ ] Server log output: I can provide server log output, but it doesn't print anything relevant to this bug. I only verified this by code review and checking the strace output for all the threads.

May 17 '17 21:05 cchadowitz

Thanks for the thorough report. This should be fixed now. Incredibly long-standing bug.

May 19 '17 12:05 beniz

Awesome! Thanks for the quick turn-around. An additional comment:

It appears that if the model folder contains a mean file that isn't named mean.binaryproto, it returns this when making a predict call:

{
  "status": {
    "code": 500,
    "msg": "InternalError",
    "dd_code": 1007,
    "dd_msg": "src/caffe/util/io.cpp:63 / Check failed (custom): (fd) != (-1)"
  }
}

My binaryproto file was named hybridCNN_mean.binaryproto in that case. When I renamed it to mean.binaryproto, it returned predictions with no error, and I verified by the same steps as above via strace that it in fact loaded the mean file: trace.188:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15

As a point of reference, when the mean file was not named correctly, strace showed the following: trace.52:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = -1 ENOENT (No such file or directory)

Thanks again!

May 19 '17 15:05 cchadowitz

Reopening this issue as it seems like there is still a problem when built with GPU.

When built with GPU support, I see the following:

$ curl -X PUT "http://localhost:${SVCPORT}/services/hybridcnn0" -d '{
  "mllib":"caffe",
  "description":"hybridCNN",
  "type":"supervised",
  "parameters":{
    "input":{
      "connector":"image",
      "width":227,
      "height":227
    },
    "mllib":{
      "nclasses":1183
    }
  },
  "model":{
    "repository":"/opt/models/hybridCNN/"
  }
}'|jq

{
  "status": {
    "code": 201,
    "msg": "Created"
  }
}

$ curl -X POST "http://localhost:${SVCPORT}/predict" -d '{
       "service":"hybridcnn0",
       "parameters":{
         "output":{
           "best":1
         }
       },
       "data":["/opt/dede/cat.jpg"]
     }'|jq

{
  "status": {
    "code": 500,
    "msg": "InternalError",
    "dd_code": 1007,
    "dd_msg": "src/caffe/util/io.cpp:63 / Check failed (custom): (fd) != (-1)"
  }
}

When built without GPU support, it works without a problem. Both setups are using the exact same model folders and files, and both are on commit e99ee48f94678214b0a84a063059437056558032

I again used strace to try to see what's going on, and with GPU, it's not even trying to open the mean.binaryproto file.

gpulogs$ grep -i "hybrid" *
trace.25:recvmsg(14, {msg_name(0)=NULL, msg_iov(1)=[{"PUT /services/hybridcnn0 HTTP/1."..., 1024}], msg_controllen=0, msg_flags=0}, 0) = 449
trace.25:open("/opt/models/hybridCNN/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15
trace.25:stat("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", {st_mode=S_IFREG|0664, st_size=246861616, ...}) = 0
trace.25:open("/opt/models/hybridCNN//corresp.txt", O_RDONLY) = 15
trace.25:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.25:read(15, "name: \"hybridnet\"\nlayer {\n  name"..., 8192) = 4478
trace.25:write(1, "INFO - 19:10:18 - hybrid -> data"..., 33) = 33
trace.25:write(1, "INFO - 19:10:18 - hybrid -> labe"..., 34) = 34
trace.25:write(1, "INFO - 19:10:19 - hybrid does no"..., 61) = 61
trace.25:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.25:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.25:open("/opt/models/hybridCNN//model.json", O_WRONLY|O_CREAT|O_APPEND, 0666) = 15
trace.26:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.26:read(15, "name: \"hybridnet\"\nlayer {\n  name"..., 8192) = 4478
trace.26:write(1, "INFO - 19:10:25 - hybrid -> data"..., 33) = 33
trace.26:write(1, "INFO - 19:10:25 - hybrid -> labe"..., 34) = 34
trace.26:write(1, "INFO - 19:10:26 - hybrid does no"..., 61) = 61
trace.26:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.26:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192

Whereas on CPU it is, and succeeding:

cpulogs$ grep -i "hybrid" *
trace.35:recvmsg(14, {msg_name(0)=NULL, msg_iov(1)=[{"PUT /services/hybridcnn0 HTTP/1."..., 1024}], msg_controllen=0, msg_flags=0}, 0) = 449
trace.36:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.36:read(15, "name: \"hybridnet\"\nlayer {\n  name"..., 8192) = 4478
trace.36:write(1, "INFO - 19:06:48 - hybrid -> data"..., 33) = 33
trace.36:write(1, "INFO - 19:06:48 - hybrid -> labe"..., 34) = 34
trace.36:write(1, "INFO - 19:06:49 - hybrid does no"..., 61) = 61
trace.36:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.36:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.36:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15
trace.38:open("/opt/models/hybridCNN/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15
trace.38:stat("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", {st_mode=S_IFREG|0664, st_size=246861616, ...}) = 0
trace.38:open("/opt/models/hybridCNN//corresp.txt", O_RDONLY) = 15
trace.38:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.38:read(15, "name: \"hybridnet\"\nlayer {\n  name"..., 8192) = 4478
trace.38:write(1, "INFO - 19:06:40 - hybrid -> data"..., 33) = 33
trace.38:write(1, "INFO - 19:06:40 - hybrid -> labe"..., 34) = 34
trace.38:write(1, "INFO - 19:06:41 - hybrid does no"..., 61) = 61
trace.38:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.38:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.38:open("/opt/models/hybridCNN//model.json", O_WRONLY|O_CREAT|O_APPEND, 0666) = 15

I can also provide the full strace for each if need be.

May 25 '17 19:05 cchadowitz

Make sure you don't have something else in the way because on predict calls, the reading of the mean.binaryproto file occurs within the connector, before anything with Caffe. In other words, at the moment I don't see how this can be related or even correlated to GPU...

May 27 '17 07:05 beniz

I don't see how there could be anything else in the way - I'm basically using the exact same setup, except one is built with GPU support and the other isn't. They both use the same model folders and everything. They're each distinct docker images, so it's easy for me to replicate or test, or send you the images directly.

I'm also not sure how this could be related to the GPU, but as I said above, I'm using the same build environments and settings and models with each, except enabling GPU for TF only (so Caffe is even CPU only with the GPU docker image).

If it helps, I can attach my build process here just to be certain.

May 27 '17 19:05 cchadowitz

deepdetect deepdetect copied to clipboard

Caffe services are not using any mean.binaryproto files

Configuration

Your question / the problem you're facing:

Error message (if any) / steps to reproduce the problem:

deepdetect
deepdetect copied to clipboard