vega icon indicating copy to clipboard operation
vega copied to clipboard

Specific database

Open vanessasidrim opened this issue 2 years ago • 17 comments

Is it possible to run sp-nas with its own database (unlike mscoco, pascalvoc...)?

vanessasidrim avatar Jun 02 '22 11:06 vanessasidrim


Can not use custom datasets directly.

There are two options:

  1. Implement your dataset class and register it into Vega.
  2. Or convert your dataset to Coco format.

zhangjiajin avatar Jun 06 '22 01:06 zhangjiajin

I performed the conversion of my database to coco format, I managed to execute the sp-nas but the training and validation results (mAP and AP) are zeroed. Is it necessary to make any changes to the metrics code as well? In the implementation, do you use the MSCOCO API to generate these metrics?

vanessasidrim avatar Jun 06 '22 11:06 vanessasidrim


You just need to change the data format.

Please attach run logs to help resolve this issue. <task id>/logs/

zhangjiajin avatar Jun 06 '22 11:06 zhangjiajin


It is possible that the number of classification does not match the pre-trained model. Adjust the number of classifications to finetune and check whether the precision increases.

            local_base_path: /VEGA/vega/examples/nas/sp_nas/tasks

pipeline: [fine_tune]                 # <-- Only finetune. Check whether the precision increases.

        type: TrainPipeStep

        pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
            type: FasterRCNN
            convert_pretrained: True
            num_classes: <classes of your dataset>                 # <- set number of classes
                type: SerialBackbone

        type: Trainer
        epochs: 25                               # <-- fine tune 25 epochs
        # with_train: False                    # <-- disable this parameter
            type: SGD
                lr: 0.02
                momentum: 0.9
                weight_decay: !!float 1e-4
            type: WarmupScheduler
            by_epoch: False
                warmup_type: linear
                warmup_iters: 1000
                warmup_ratio: 0.001
                    type: MultiStepLR
                    by_epoch: True
                        milestones: [ 10, 20 ]
                        gamma: 0.1
            type: SumLoss
            type: coco
                anno_path: /VEGA/isolador_coco/annotations/instances_val2017.json

        type: CocoDataset
            data_root: /VEGA/isolador_coco/
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances

zhangjiajin avatar Jun 07 '22 01:06 zhangjiajin

I ran with this configuration and got the following error: Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])

vanessasidrim avatar Jun 08 '22 10:06 vanessasidrim


Please update the config, specify the head name:

        pretrained_model_file: /VEGA/vega/examples/nas/sp_nas/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
        head: roi_heads                              # <-- specify the head name
            type: FasterRCNN
            convert_pretrained: True
            num_classes: <classes of your dataset>
                type: SerialBackbone

And update the file: vega/networks/

zhangjiajin avatar Jun 08 '22 10:06 zhangjiajin

same error occurred after changes

Unexpected key(s) in state_dict for conversion: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score .weight torch.Size([1, 1024])

vanessasidrim avatar Jun 08 '22 11:06 vanessasidrim

My logs:

Before change code:

2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.359 INFO   Step: fine_tune
2022-06-09 02:22:12.359 INFO ------------------------------------------------
2022-06-09 02:22:12.366 INFO init TrainPipeStep...
2022-06-09 02:22:12.366 INFO TrainPipeStep started...
2022-06-09 02:22:12.798 INFO Model was created.
2022-06-09 02:22:12.799 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:22:12.969 ERROR Failed to run worker, id: 0, message: Unexpected key(s) in state_dict for convert: roi_heads.box_predictor.cls_score.weight torch.Size([91, 1024]) --> roi_heads.box_predictor.cls_score.weight torch.Size([1, 1024])
2022-06-09 02:22:15.13 INFO ------------------------------------------------

After change code:

2022-06-09 02:28:34.877 INFO ------------------------------------------------
2022-06-09 02:28:34.879 INFO   Step: fine_tune
2022-06-09 02:28:34.879 INFO ------------------------------------------------
2022-06-09 02:28:34.885 INFO init TrainPipeStep...
2022-06-09 02:28:34.885 INFO TrainPipeStep started...
2022-06-09 02:28:35.360 INFO Model was created.
2022-06-09 02:28:35.361 INFO load model weights from file, weights file=/cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
2022-06-09 02:28:35.533 INFO Not Swap Keys: ['roi_heads.box_head.fc6.weight', 'roi_heads.box_head.fc6.bias', 'roi_heads.box_head.fc7.weight', 'roi_heads.box_head.fc7.bias', 'roi_heads.box_predictor.cls_score.weight', 'roi_heads.box_predictor.cls_score.bias', 'roi_heads.box_predictor.bbox_pred.weight', 'roi_heads.box_predictor.bbox_pred.bias']
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.67s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
2022-06-09 02:28:44.695 INFO flops: 177.56315298500002 , params:41347.156
/pytorch/aten/src/THCUNN/ void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed. 
      ^-- This issue is caused by a mismatch between the number of categories in the dataset and the number of categories in the configuration file. This is an expected log.
2022-06-09 02:28:45.618 ERROR Failed to run worker, id: 0, message: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
2022-06-09 02:28:47.628 INFO ------------------------------------------------

The issue has been resolved.

Check whether the configuration file contains:

head: roi_heads

And the file vega/networks/ is replaced correctly.

  1. Find Vega's localtion.
~/repo/automl$ pip3 show noah-vega
Name: noah-vega
Version: 1.8.0
Summary: AutoML Toolkit
Author: Huawei Noah's Ark Lab
License: Apache License 2.0
Location: /home/user/.local/lib/python3.7/site-packages      <--- here
Requires: click, distributed, numpy, opencv-python, pandas, pillow, psutil, PyYAML, pyzmq, scikit-learn, scipy, tensorboardX, thop
  1. Replace the following file:

zhangjiajin avatar Jun 09 '22 02:06 zhangjiajin

I managed to run but the results at all times are current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000]

vanessasidrim avatar Jun 16 '22 14:06 vanessasidrim


Is the information in the NAS or finetune phase?

zhangjiajin avatar Jun 21 '22 11:06 zhangjiajin

this is the return of finetune phase execution

vanessasidrim avatar Jun 21 '22 11:06 vanessasidrim


That's because the predicted results didn't hit. The accuracy is -1. The dataset may be labeled incorrectly.

zhangjiajin avatar Jun 21 '22 11:06 zhangjiajin

Could you tell me if the segmentation values ​​impact the calculation of these metrics?

As my dataset was in VOC format I performed the conversion to COCO format and this information was non-existent but mandatory, as I am only interested in detection I inserted random values ​​for this key in the .json file

vanessasidrim avatar Jun 21 '22 11:06 vanessasidrim


The segmentation values do not ​​impact the calculation of these metrics.

We found a setting that needs to adjust the number of classes in the dataset, as shown in the following:

        type: CocoDataset
            data_root: /VEGA/isolador_coco/
            num_classes: 1          # <--- here
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances

We are also trying to convert the VOC format to the COCO format to see if there are other issues.

zhangjiajin avatar Jun 22 '22 12:06 zhangjiajin


I used the tool voc2coco to change the format of BCCD_Dataset to COCO.

Then changed the image ID from string to integer, such as id and image_id.

    "images": [
            "file_name": "BloodImage_00000.jpg",
            "height": 480,
            "width": 640,
            "id": 0
    "annotations": [
            "area": 46400,
            "iscrowd": 0,
            "bbox": [
            "category_id": 3,
            "ignore": 0,
            "segmentation": [],
            "image_id": 0,
            "id": 1

Run the following command to perform fine tuning:

pipeline: [fine_tune]

        type: TrainPipeStep

        pretrained_model_file: /cache/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
        head: roi_heads
            type: FasterRCNN
            convert_pretrained: True
            num_classes: 4
                type: SerialBackbone

        type: Trainer
        epochs: 25
        # with_train: False
            type: SGD
                lr: 0.02
                momentum: 0.9
                weight_decay: !!float 1e-4
            type: WarmupScheduler
            by_epoch: False
                warmup_type: linear
                warmup_iters: 1000
                warmup_ratio: 0.001
                    type: MultiStepLR
                    by_epoch: True
                        milestones: [ 10, 20 ]
                        gamma: 0.1
            type: SumLoss
            type: coco
                anno_path: /datasets/voc_coco/bccd_coco/annotations/instances_val2017.json

        type: CocoDataset
            data_root: /datasets/voc_coco/bccd_coco/
            batch_size: 4
            img_prefix: "2017"
            ann_prefix: instances
            num_classes: 3
            test_size: 1

Note that the value of num_classes in model_desc is 4 and the value of num_classes in dataset is 3, because the dataset type in the configuration file of the dataset is 1, 2, and 3, and does not start from 0.

zhangjiajin avatar Jun 23 '22 06:06 zhangjiajin


In the 14th epoch, the gradient explodes, and all of the metrics are -1.

2022-06-23 02:26:53.528 INFO worker id [0], epoch [13/25], current valid perfs [mAP: 36.574, AP50: 77.358, AP_small: 4.000, AP_medium: 24.747, AP_large: 49.130], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]
2022-06-23 02:26:55.360 INFO worker id [0], epoch [14/25], train step [ 0/51], loss [   0.733,    0.733], lr [   0.0132468],  time pre batch [0.992s] , total mean time per batch [0.992s]
2022-06-23 02:27:05.961 INFO worker id [0], epoch [14/25], train step [10/51], loss [   0.776,    0.732], lr [   0.0134466],  time pre batch [0.986s] , total mean time per batch [1.006s]
2022-06-23 02:27:16.724 INFO worker id [0], epoch [14/25], train step [20/51], loss [274910.344, 13100.794], lr [   0.0136464],  time pre batch [0.998s] , total mean time per batch [1.006s]
2022-06-23 02:27:27.158 INFO worker id [0], epoch [14/25], train step [30/51], loss [     nan,      nan], lr [   0.0138462],  time pre batch [0.970s] , total mean time per batch [1.006s]
2022-06-23 02:27:38.9 INFO worker id [0], epoch [14/25], train step [40/51], loss [     nan,      nan], lr [   0.0140460],  time pre batch [1.000s] , total mean time per batch [1.006s]
2022-06-23 02:27:48.928 INFO worker id [0], epoch [14/25], train step [50/51], loss [     nan,      nan], lr [   0.0142458],  time pre batch [1.006s] , total mean time per batch [1.006s]
2022-06-23 02:27:58.31 INFO worker id [0], epoch [14/25], current valid perfs [mAP: -1.000, AP_small: -1.000, AP_medium: -1.000, AP_large: -1.000], best valid perfs [mAP: 52.552, AP50: 82.835, AP_small: 9.299, AP_medium: 37.015, AP_large: 62.955]

Adjust the learning rate to 1/2 of the original value.

            type: SGD
                lr: 0.01
                momentum: 0.9
                weight_decay: !!float 1e-4

Training success:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.537
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.797
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.615
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.147
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.653
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.375
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.577
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.157
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.485
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.725
2022-06-23 03:07:05.383 INFO worker id [0], epoch [25/25], current valid perfs [mAP: 53.677, AP50: 79.720, AP_small: 14.728, AP_medium: 36.768, AP_large: 65.289], best valid perfs [mAP: 54.138, AP50: 80.392, AP_small: 13.985, AP_medium: 37.537, AP_large: 65.649]
2022-06-23 03:07:06.133 INFO flops: 177.578512985 , params:41362.531
2022-06-23 03:07:06.133 INFO Finished the unified trainer successfully.
2022-06-23 03:07:08.335 INFO ------------------------------------------------
2022-06-23 03:07:08.335 INFO   Pipeline end.
2022-06-23 03:07:08.335 INFO 
2022-06-23 03:07:08.335 INFO   task id: 0623.023919.824
2022-06-23 03:07:08.335 INFO   output folder: /data/tasks/0623.023919.824/output
2022-06-23 03:07:08.335 INFO 
2022-06-23 03:07:08.336 INFO   running time:
2022-06-23 03:07:08.336 INFO          fine_tune:  0:27:44  [2022-06-23 02:39:21.599546 - 2022-06-23 03:07:06.334006]
2022-06-23 03:07:08.336 INFO 
2022-06-23 03:07:08.343 INFO   result:
2022-06-23 03:07:08.343 INFO     0:  {'flops': 177.578512985, 'params': 41362.531, 'mAP': 54.137894672785315, 'AP50': 80.39203084795886, 'AP_small': 13.985148514851486, 'AP_medium': 37.537002484435, 'AP_large': 65.64869730651068}
2022-06-23 03:07:08.344 INFO ------------------------------------------------

zhangjiajin avatar Jun 23 '22 06:06 zhangjiajin