deepdetect icon indicating copy to clipboard operation
deepdetect copied to clipboard

SVM connector training fails

Open YaYaB opened this issue 5 years ago • 2 comments

Configuration

  • Version of DeepDetect:
    • [X] Locally compiled on:
      • [X] Ubuntu 14.04 LTS
      • [ ] Mac OSX
      • [ ] Other:
    • [ ] Docker
    • [ ] Amazon AMI
  • Commit (shown by the server when starting): 073e9a1f5cf5a565ee91c5a9b46a6b8b3afc19f3

Your question / the problem you're facing:

This issue is related to #761 and more precisely to the fix #762 . It resolved the inference issue however now it is impossible for me to train a model using a svm connector I joined some random data if you want to replicate. To replicate you can just download it and extract it. In the next section we will consider PATH_MODEL as the path where the model is stored.

Error message (if any) / steps to reproduce the problem:

Let us first create the service we will use to train:

  • [X] list of API calls:
    curl -X PUT "http://localhost:8082/services/svm_test" -d '{
                "sname": "svm_test",
                "description": "classification model",
                "mllib": "caffe",
                "type": "supervised",
                "parameters": {
                        "input": {
                                "connector": "svm"
                        },
                        "mllib": {
                                "gpu": true,
                                "gpuid": 1,
                                "template": "mlp",
                                "nclasses": 2,
                                "ntargets": null,
                                "layers": [128,64,32],
                                "activation": "relu",
                                "dropout": 0.5,
                                "regression": false,
                                "finetuning": false,
                                "db": true
                        },
                        "output":{}
                },
                "model": {
                        "repository": "PATH_MODEL/bug_svm_prediction",
                        "templates": "../templates/caffe",
                        "weights": null
                }
        }'
  • [X] Server log output:
DeepDetect [ commit 073e9a1f5cf5a565ee91c5a9b46a6b8b3afc19f3 ]
[2020-07-29 00:15:30.030] [api] [info] Running DeepDetect HTTP server on localhost:8082
[2020-07-29 00:16:02.615] [svm_test] [info] Using GPU 1
ETC.
ETC.
[2020-07-29 00:16:03.168] [svm_test] [info] instantiating model template mlp
[2020-07-29 00:16:03.168] [svm_test] [info] source=../templates/caffe/mlp/
[2020-07-29 00:16:03.168] [svm_test] [info] dest=PATH_MODEL/mlp.prototxt
[2020-07-29 00:16:03.170] [api] [info] 127.0.0.1 "PUT /services/svm_test" 201 556

Now we can try launching a training with an older version of DD (caaeb7866a80a581bef3353218c5394db1af2363). We observe that the training is launched.

  • [X] list of API calls:
curl -X POST "http://127.0.0.1:8082/train" -d '{
                "service": "svm_test",
                "async": false,
                "data": [
                        "PATH_MODEL/data/train.svm",
                        "PATH_MODEL/data/test.svm"
                ],
                "parameters":{
                        "input": {
                                "db": true
                        },
                        "mllib": {
                                "gpu": true,
                                "resume": false,
                                "ignore_label": null,
                                "solver": {
                                        "iterations": 1000,
                                        "snapshot": 500,
                                        "snapshot_prefix": null,
                                        "solver_type": "ADAM",
                                        "test_interval": 100,
                                        "test_initialization": false,
                                        "lr_policy": "step",
                                        "base_lr": 0.001,
                                        "gamma": 0.1,
                                        "stepsize": 100,
                                        "momentum": 0.9,
                                        "weight_decay": 0.00001,
                                        "power": null,
                                        "iter_size": 1
                                },
                                "net": {
                                        "batch_size": 1,
                                        "test_batch_size": 1
                                }
                        },
                        "output": {
                                "best": 2,
                                "measure": ["accp", "mcll", "f1", "mcc"]
                        }
                }
        }'
  • [X] Server log output:
[2020-07-29 00:25:17.238] [svm_test] [info] Net total flops=10560 / total params=10560
[2020-07-29 00:25:17.238] [svm_test] [info] detected network type is classification
[2020-07-29 00:25:17.238] [caffe] [info] Opened lmdb PATH_MODEL/bug_svm_prediction/test.lmdb
[2020-07-29 00:25:17.244] [api] [info] 127.0.0.1 "POST /train" 201 1297

However now if we use the new version corresponding to commit 073e9a1f5cf5a565ee91c5a9b46a6b8b3afc19f3 the train fails almost immediately.

  • [X] list of API calls:
curl -X POST "http://127.0.0.1:8082/train" -d '{
                "service": "svm_test",
                "async": false,
                "data": [
                        "PATH_MODEL/data/train.svm",
                        "PATH_MODEL/data/test.svm"
                ],
                "parameters":{
                        "input": {
                                "db": true
                        },
                        "mllib": {
                                "gpu": true,
                                "resume": false,
                                "ignore_label": null,
                                "solver": {
                                        "iterations": 1000,
                                        "snapshot": 500,
                                        "snapshot_prefix": null,
                                        "solver_type": "ADAM",
                                        "test_interval": 100,
                                        "test_initialization": false,
                                        "lr_policy": "step",
                                        "base_lr": 0.001,
                                        "gamma": 0.1,
                                        "stepsize": 100,
                                        "momentum": 0.9,
                                        "weight_decay": 0.00001,
                                        "power": null,
                                        "iter_size": 1
                                },
                                "net": {
                                        "batch_size": 1,
                                        "test_batch_size": 1
                                }
                        },
                        "output": {
                                "best": 2,
                                "measure": ["accp", "mcll", "f1", "mcc"]
                        }
                }
        }'
  • [X] Server log output:
{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"./include/caffe/util/db_lmdb.hpp:15 / Check failed (custom): (mdb_status) == (0)"}}

[2020-07-29 00:20:37.494] [svm_test] [info] detected network type is classification
[2020-07-29 00:20:37.505] [svm_test] [info] Iteration 0, lr = 0.001, smoothed_loss=0.523027
[2020-07-29 00:20:37.562] [caffe] [info] Ignoring source layer prob
[2020-07-29 00:20:37.562] [svm_test] [error] Error while filling up network for testing
[2020-07-29 00:20:37.565] [svm_test] [error] training call failed
[2020-07-29 00:20:37.565] [api] [error] 127.0.0.1 "POST /train" 500 814

bug_svm_training.zip

YaYaB avatar Jul 28 '20 22:07 YaYaB

Ah, see #765. I had tested in-memory training only. To add svm + db training to unit tests I had to add test_split support to SVM + db training first, so this is what #765 does now. I've left a test for in-memory svm training.

beniz avatar Jul 29 '20 11:07 beniz

It seems to work on my side. Thanks!

YaYaB avatar Jul 29 '20 23:07 YaYaB