Configuration

Version of DeepDetect:
- [ ] Locally compiled on:
  - [ ] Ubuntu 18.04 LTS
  - [ ] Other:
- [ ] Docker CPU
- [x] Docker GPU
- [ ] Amazon AMI
Commit (shown by the server when starting): master @ be79e543a5f7c73949e1d5fbe97a4d2890548c3c

Your question / the problem you're facing:

I'm trying to run a jupyter notebook with the following code:

from dd_widgets import Classification, CSV, Text, Segmentation, Detection, OCR, TSNE_CSV

ocr = OCR(
    'word_mnist',
    training_repo='/opt/platform/examples/word_mnist/train.txt',
    testing_repo='/opt/platform/examples/word_mnist/test.txt',
    host='deepdetect',
    port=8080,
    img_height=80,
    img_width=128,
    model_repo='/opt/platform/models/training/examples/words_mnist',
    nclasses=100,
    template='crnn',
    iterations=10000,
    test_interval=1000,
    snapshot_interval=1000,
    batch_size=128,
    test_batch_size=32,
    noise_prob=0.001,
    distort_prob=0.001,
    gpuid=1,
    base_lr=0.0001,
    solver_type='ADAM',
    mirror=False,
    rotate=False,
    resume=False
)
ocr

Error message (if any) / steps to reproduce the problem:

Outputs after running trainer:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
     90         host = self.host.value
     91         port = self.port.value
---> 92         body = self._create_service_body()
     93 
     94         sname_dict = dict(

/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in _create_service_body(self)
    128                     {
    129                         "input": {
--> 130                             **self._create_parameters_input(),
    131                             **self._append_create_parameters_input,
    132                         },

/opt/conda/lib/python3.8/site-packages/dd_widgets/mixins.py in _create_parameters_input(self)
    268             "height": height,
    269             "bw": self.bw.value,
--> 270             "histogram_equalization": self.histogram_equalization.value,
    271             "db": True,
    272         }

AttributeError: 'OCR' object has no attribute 'histogram_equalization'

Adding to /opt/conda/lib/python3.8/site-packages/dd_widgets/ocr.py:

histogram_equalization: bool = False,
rgb: bool = False,

Results in:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

Aug 08 '22 08:08 iamdroppy

Hi, it's an issue related to dd_widget that has been fixed on the latest master, but the packages have not been updated, sorry for the inconvenience. How did you install dd_widget? Until we update the packages you can fix your issue by installing dd_widgets latest master.

Aug 08 '22 09:08 Bycob

The second error message still remains though.. I've updated the container directly.

Apologies for the wrong repo, I was unsure, I'd guess the second one belongs to this one?

If you happen to have any clue what's going on I'll be immensively thankful, I've been trying to deploy all morning but the error messages aren't really helpful...

Kind regards, Lucca Ferri

Aug 08 '22 10:08 iamdroppy

The second message is a DD error indicating there is something wrong with the GPU. You can try nvidia-smi to see if something is wrong, ensure that you run the dd GPU docker with nvidia-docker installed and that there is enough memory available on your GPU

Aug 08 '22 11:08 Bycob

@Bycob sorry to disturb your time, but I'm really lost on one thing:

nvidia-smi returns OK, and nvidia-docker is installed also.

My question is, a RTX 3080 is enough to support this for testing and small datasets?

Kind regards and once again, thanks for the support, I can't thank you enough to point me in the right direction!

Aug 08 '22 16:08 iamdroppy

RTX 3080 should be perfectly fine. I will try to reproduce and come back to you. Any log or system information would be helpful, especially the deepdetect server logs

Aug 09 '22 16:08 Bycob

GPU

My current issues with DeepDetect - note that I will update it whilst trying

dd_widgets

I'm updating dd_widgets with:

$ docker exec -it jupyter_dd /bin/bash
$ ~: apt update -y; apt install vim git -y; \
       git clone https://github.com/jolibrain/dd_widgets; \
       rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
       mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/

deepdetect

Deep detect logs the following:

deepdetect_1      | [2022-08-09 17:19:40.732] [api] [error] service not found: "word_mnist"
deepdetect_1      | [2022-08-09 17:19:40.732] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1      | [2022-08-09 17:19:40.738] [word_mnist] [info] Using GPU 1
deepdetect_1      | [2022-08-09 17:19:40.738] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1      | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1      | 
deepdetect_1      | [2022-08-09 17:19:40.738] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms
deepdetect_1      | [2022-08-09 17:19:41.335] [api] [info] HTTP/1.1 "GET /info" <n/a> 200 0ms

Whilst, the stacktrace is shown on the UI:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

CPU

Currently having issues with the preloaded models:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1006: Service Bad Request Error: using template while model prototxt and network weights exist, remove 'template' from 'mllib' or remove prototxt files instead instead ?

But even when fixing it, it says it's training but the metrics.json keeps returning the same value:

(base) root@9dcc26872ff5:/opt/platform/models/training/examples/words_mnist# cat metrics.json 
{"status":{"code":200,"msg":"OK"},"head":{"method":"/train","job":1,"status":"running","time":65.0},"body":{"sname":"word_mnist","mltype":"ctc","measure_hist":{"train_loss_hist":[70.4117202758789,71.0136489868164,50.909339904785159],"elapsed_time_ms_hist":[16774.0,32144.0,49401.0],"learning_rate_hist":[0.00009999999747378752,0.00009999999747378752,0.00009999999747378752]},"description":"word_mnist","measure_sampling":{},"measure":{"test_names":{},"iteration":3.0,"elapsed_time_ms":49401.0,"remain_time_str":"1d:23h:33m:15s","train_loss":50.909339904785159,"flops":2041736704,"iter_time":17123.0,"iteration_duration_ms":17123.0,"remain_time":171195.765625,"learning_rate":0.00009999999747378752},"model":{"repository":"/opt/platform/models/training/examples/words_mnist"}}}

Beautified:

{
  "status": {
    "code": 200,
    "msg": "OK"
  },
  "head": {
    "method": "/train",
    "job": 1,
    "status": "running",
    "time": 65
  },
  "body": {
    "sname": "word_mnist",
    "mltype": "ctc",
    "measure_hist": {
      "train_loss_hist": [
        70.4117202758789,
        71.0136489868164,
        50.909339904785156
      ],
      "elapsed_time_ms_hist": [
        16774,
        32144,
        49401
      ],
      "learning_rate_hist": [
        0.00009999999747378752,
        0.00009999999747378752,
        0.00009999999747378752
      ]
    },
    "description": "word_mnist",
    "measure_sampling": {},
    "measure": {
      "test_names": {},
      "iteration": 3,
      "elapsed_time_ms": 49401,
      "remain_time_str": "1d:23h:33m:15s",
      "train_loss": 50.909339904785156,
      "flops": 2041736704,
      "iter_time": 17123,
      "iteration_duration_ms": 17123,
      "remain_time": 171195.765625,
      "learning_rate": 0.00009999999747378752
    },
    "model": {
      "repository": "/opt/platform/models/training/examples/words_mnist"
    }
  }
}

No updates happens there. It may be normal - I'm still researching the software, I hope to get a good grasp on it soon

Thanks again for your time! Kind Regards, Lucca Ferri

Aug 09 '22 17:08 iamdroppy

CPU training is really slow, that's why you may not see the metrics.json change often. From what you sent, it looks like the model is training normally.

For GPU training, do you have multiple GPUs on your machine? If not, try to change gpuid to 0

Aug 10 '22 08:08 Bycob

For some reason, the same happens on gpuid 0...

My nvidia-smi:

Wed Aug 10 11:54:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   42C    P5    29W / 320W |    599MiB / 10240MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     12411      G   /usr/lib/xorg/Xorg                332MiB |
|    0   N/A  N/A     12581      G   /usr/bin/gnome-shell               61MiB |
|    0   N/A  N/A    134001      G   ...0/usr/lib/firefox/firefox      204MiB |
+-----------------------------------------------------------------------------+

Docker nvidia-docker:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    23W / 320W |    601MiB / 10240MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Aug 10 '22 14:08 iamdroppy

Exactly the same error? do you have dd logs?

Aug 10 '22 16:08 Bycob

Hello again @Bycob , yes, same error, logs:

deepdetect_1      | [2022-08-10 18:04:19.858] [api] [error] service not found: "word_mnist"
deepdetect_1      | [2022-08-10 18:04:19.858] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1      | [2022-08-10 18:04:19.863] [word_mnist] [info] Using GPU 0
deepdetect_1      | [2022-08-10 18:04:19.863] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1      | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1      | 
deepdetect_1      | [2022-08-10 18:04:19.863] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms

UI:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

When I try to publish the archived service:

Error while publishing service

InternalError: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

Strage thing is, it appears as Archived Job! Progress!

Edit: deepserver/info:

{
  "dd_msg": null,
  "status": null,
  "head": {
    "method": "/info",
    "build-type": "dev",
    "version": "v0.21.0-dirty",
    "branch": "heads/v0.21.0",
    "commit": "385122d4eace490ab95fa7a7b9ed92121af1414e",
    "compile_flags": "USE_CAFFE2=OFF USE_TF=OFF USE_NCNN=OFF USE_TORCH=OFF USE_HDF5=ON USE_CAFFE=ON USE_TENSORRT=OFF USE_TENSORRT_OSS=OFF USE_DLIB=OFF USE_CUDA_CV=OFF USE_SIMSEARCH=ON USE_ANNOY=OFF USE_FAISS=ON USE_COMMAND_LINE=ON USE_JSON_API=ON USE_HTTP_SERVER=OFF USE_CUDA_CV=OFF",
    "deps_version": "OPENCV_VERSION=4.2.0 CUDA_VERSION=11.1 CUDNN_VERSION=8.0.5 TENSORRT_VERSION=",
    "services": []
  },
  "body": null
}

Edit: when creating a service, it says: No gpu found for deepdetect server. - but a GPU is detected on the right side of the panel (alongside its temperature etc).

Aug 10 '22 18:08 iamdroppy

Update: testing now with Ubuntu 20.04 LTS, will give back the results.

Aug 12 '22 16:08 iamdroppy

Update: it's all working now, tomorrow I'll edit this with how I managed to accomplish.

Aug 15 '22 05:08 iamdroppy

As promised (apologies for the delay).

/code/gpu

.env:

DD_PLATFORM=./../..
DD_SERVER_TAG=latest
DD_SERVER_IMAGE=gpu_torch
DD_PLATFORM_UI_TAG=latest
DD_JUPYTER_TAG=latest
DD_FILEBROWSER_TAG=latest

docker-compose.yml:

version: '2.3'
services:

  #
  # Platform Data
  #
  # Get data from dockerhub to run various services
  #

  platform_data:
    image: jolibrain/platform_data:latest
    user: ${CURRENT_UID}
    volumes:
      - ${DD_PLATFORM}:/platform


  #
  # Deepdetect
  #

  deepdetect:
    image: jolibrain/deepdetect_${DD_SERVER_IMAGE}:${DD_SERVER_TAG}
    runtime: nvidia
    restart: always
    volumes:
      - ${DD_PLATFORM}:/opt/platform

  #
  # Platform UI
  #
  # modify port 80 to change facade port
  #

  platform_ui:
    image: jolibrain/platform_ui:${DD_PLATFORM_UI_TAG}
    restart: always
    ports:
      - '${DD_PORT:-1912}:80'
    links:
      - jupyter:jupyter
      - deepdetect:deepdetect
      - gpustat_server
      - filebrowser
      - dozzle
    volumes:
      - ./config/nginx/nginx.conf:/etc/nginx/nginx.conf
      - ${DD_PLATFORM}:/opt/platform
      - ./config/platform_ui/config.json:/usr/share/nginx/html/config.json
      - ./.env:/usr/share/nginx/html/version

  #
  # Jupyter notebooks
  #

  jupyter:
    image: jolibrain/jupyter_dd_notebook:${DD_JUPYTER_TAG}
    runtime: nvidia
    user: root
    environment:
      - JUPYTER_LAB_ENABLE=yes
      - NB_UID=${MUID}
    volumes:
      - ${DD_PLATFORM}:/opt/platform
      - ${DD_PLATFORM}/notebooks:/home/jovyan/work

  #
  # gpustat-server
  #
  gpustat_server:
    image:  jolibrain/gpustat_server
    runtime: nvidia

  #
  # filebrowser
  #
  filebrowser:
    image: jolibrain/filebrowser:${DD_FILEBROWSER_TAG}
    restart: always
    user: ${CURRENT_UID}
    volumes:
      - ${DD_PLATFORM}/data:/srv/data

  #
  # real-time log viewer for docker containers
  #
  dozzle:
    image: amir20/dozzle
    restart: always
    environment:
      - DOZZLE_BASE=/docker-logs
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Update to dd_widgets:

docker exec -it jupyter_dd_notebook /bin/bash

apt update -y; apt install vim git -y; \
    git clone https://github.com/jolibrain/dd_widgets; \
    rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
    mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/

---

My setup GPUID = 0 and Engine as DEFAULT

Aug 17 '22 12:08 iamdroppy