deepdetect
deepdetect copied to clipboard
Caffe services are not using any mean.binaryproto files
Configuration
- Version of DeepDetect:
- [x] Docker
- Commit (shown by the server when starting):
754dedc7d23f6cf75dde15775f73f4aa95641e76
, but based on code review also present in most recent commit in themaster
branch.
Your question / the problem you're facing:
Caffe models when loaded during service creation do not pick up associated mean.binaryproto
files in the same folder.
Even though this line appears to check for the existence of the mean.binaryproto
file, the check for _has_mean_file
seems to fail here. I verified that by looking back at the commit 6a35a4a2e8afc83561dbbc85468177d9778a5217 where that line was originally:
if (_data_mean.count() == 0 && fileops::file_exists(meanfullname))
I patched the current version to use that file check instead of using _has_mean_file
and it now is loading the mean.binaryproto
file correctly.
I verified that it was not loading the mean file with the original master code and that it was loading the mean file with the patch I described above by running the dede
binary through strace
(e.g. strace -o trace -ff ./dede -host 0.0.0.0 -port 8080
) and grep-ing the log files for any reference to the mean.binaryproto
file. Without the patch, it never referenced the file, and so never opened the file:
$ grep "mean" trace.*
$
With the patch, it does:
$ grep -i "mean" trace*
trace.4701:stat("/opt/models/hybridCNN//mean.binaryproto", {st_mode=S_IFREG|0644, st_size=786446, ...}) = 0
trace.4701:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15
It's a large diff, but you can see the code changes here. Check file src/caffeinputconns.h
, line 137. I'm not sure why the file check in src/caffemodel.cc
doesn't work correctly, though.
Error message (if any) / steps to reproduce the problem:
- [x] list of API calls: Create a service:
curl -X PUT "http://localhost:8080/services/hybridcnn0" -d '{
"mllib":"caffe",
"description":"hybridCNN",
"type":"supervised",
"parameters":{
"input":{
"connector":"image",
"width":227,
"height":227
},
"mllib":{
"nclasses":1183
}
},
"model":{
"repository":"/opt/models/hybridCNN/"
}
}'
Predict a test image on the service:
curl -X POST "http://localhost:8080/predict" -d '{
"service":"hybridcnn0",
"parameters":{
"output":{
"best":2
}
},
"data":["sample_000001.jpg"]
}'
- [ ] Server log output:
I can provide server log output, but it doesn't print anything relevant to this bug. I only verified this by code review and checking the
strace
output for all the threads.
Thanks for the thorough report. This should be fixed now. Incredibly long-standing bug.
Awesome! Thanks for the quick turn-around. An additional comment:
It appears that if the model folder contains a mean file that isn't named mean.binaryproto
, it returns this when making a predict call:
{
"status": {
"code": 500,
"msg": "InternalError",
"dd_code": 1007,
"dd_msg": "src/caffe/util/io.cpp:63 / Check failed (custom): (fd) != (-1)"
}
}
My binaryproto file was named hybridCNN_mean.binaryproto
in that case. When I renamed it to mean.binaryproto
, it returned predictions with no error, and I verified by the same steps as above via strace
that it in fact loaded the mean file:
trace.188:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15
As a point of reference, when the mean file was not named correctly, strace
showed the following:
trace.52:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = -1 ENOENT (No such file or directory)
Thanks again!
Reopening this issue as it seems like there is still a problem when built with GPU.
When built with GPU support, I see the following:
$ curl -X PUT "http://localhost:${SVCPORT}/services/hybridcnn0" -d '{
"mllib":"caffe",
"description":"hybridCNN",
"type":"supervised",
"parameters":{
"input":{
"connector":"image",
"width":227,
"height":227
},
"mllib":{
"nclasses":1183
}
},
"model":{
"repository":"/opt/models/hybridCNN/"
}
}'|jq
{
"status": {
"code": 201,
"msg": "Created"
}
}
$ curl -X POST "http://localhost:${SVCPORT}/predict" -d '{
"service":"hybridcnn0",
"parameters":{
"output":{
"best":1
}
},
"data":["/opt/dede/cat.jpg"]
}'|jq
{
"status": {
"code": 500,
"msg": "InternalError",
"dd_code": 1007,
"dd_msg": "src/caffe/util/io.cpp:63 / Check failed (custom): (fd) != (-1)"
}
}
When built without GPU support, it works without a problem. Both setups are using the exact same model folders and files, and both are on commit e99ee48f94678214b0a84a063059437056558032
I again used strace
to try to see what's going on, and with GPU, it's not even trying to open the mean.binaryproto
file.
gpulogs$ grep -i "hybrid" *
trace.25:recvmsg(14, {msg_name(0)=NULL, msg_iov(1)=[{"PUT /services/hybridcnn0 HTTP/1."..., 1024}], msg_controllen=0, msg_flags=0}, 0) = 449
trace.25:open("/opt/models/hybridCNN/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15
trace.25:stat("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", {st_mode=S_IFREG|0664, st_size=246861616, ...}) = 0
trace.25:open("/opt/models/hybridCNN//corresp.txt", O_RDONLY) = 15
trace.25:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.25:read(15, "name: \"hybridnet\"\nlayer {\n name"..., 8192) = 4478
trace.25:write(1, "INFO - 19:10:18 - hybrid -> data"..., 33) = 33
trace.25:write(1, "INFO - 19:10:18 - hybrid -> labe"..., 34) = 34
trace.25:write(1, "INFO - 19:10:19 - hybrid does no"..., 61) = 61
trace.25:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.25:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.25:open("/opt/models/hybridCNN//model.json", O_WRONLY|O_CREAT|O_APPEND, 0666) = 15
trace.26:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.26:read(15, "name: \"hybridnet\"\nlayer {\n name"..., 8192) = 4478
trace.26:write(1, "INFO - 19:10:25 - hybrid -> data"..., 33) = 33
trace.26:write(1, "INFO - 19:10:25 - hybrid -> labe"..., 34) = 34
trace.26:write(1, "INFO - 19:10:26 - hybrid does no"..., 61) = 61
trace.26:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.26:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
Whereas on CPU it is, and succeeding:
cpulogs$ grep -i "hybrid" *
trace.35:recvmsg(14, {msg_name(0)=NULL, msg_iov(1)=[{"PUT /services/hybridcnn0 HTTP/1."..., 1024}], msg_controllen=0, msg_flags=0}, 0) = 449
trace.36:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.36:read(15, "name: \"hybridnet\"\nlayer {\n name"..., 8192) = 4478
trace.36:write(1, "INFO - 19:06:48 - hybrid -> data"..., 33) = 33
trace.36:write(1, "INFO - 19:06:48 - hybrid -> labe"..., 34) = 34
trace.36:write(1, "INFO - 19:06:49 - hybrid does no"..., 61) = 61
trace.36:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.36:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.36:open("/opt/models/hybridCNN//mean.binaryproto", O_RDONLY) = 15
trace.38:open("/opt/models/hybridCNN/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15
trace.38:stat("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", {st_mode=S_IFREG|0664, st_size=246861616, ...}) = 0
trace.38:open("/opt/models/hybridCNN//corresp.txt", O_RDONLY) = 15
trace.38:open("/opt/models/hybridCNN//deploy.prototxt", O_RDONLY) = 15
trace.38:read(15, "name: \"hybridnet\"\nlayer {\n name"..., 8192) = 4478
trace.38:write(1, "INFO - 19:06:40 - hybrid -> data"..., 33) = 33
trace.38:write(1, "INFO - 19:06:40 - hybrid -> labe"..., 34) = 34
trace.38:write(1, "INFO - 19:06:41 - hybrid does no"..., 61) = 61
trace.38:open("/opt/models/hybridCNN//hybridCNN_iter_700000.caffemodel", O_RDONLY) = 15
trace.38:read(15, "\n\thybridnet\22]\nN\n\4data\22\4data\202\1\27tr"..., 8192) = 8192
trace.38:open("/opt/models/hybridCNN//model.json", O_WRONLY|O_CREAT|O_APPEND, 0666) = 15
I can also provide the full strace
for each if need be.
Make sure you don't have something else in the way because on predict
calls, the reading of the mean.binaryproto
file occurs within the connector, before anything with Caffe. In other words, at the moment I don't see how this can be related or even correlated to GPU...
I don't see how there could be anything else in the way - I'm basically using the exact same setup, except one is built with GPU support and the other isn't. They both use the same model folders and everything. They're each distinct docker images, so it's easy for me to replicate or test, or send you the images directly.
I'm also not sure how this could be related to the GPU, but as I said above, I'm using the same build environments and settings and models with each, except enabling GPU for TF only (so Caffe is even CPU only with the GPU docker image).
If it helps, I can attach my build process here just to be certain.