centerX icon indicating copy to clipboard operation
centerX copied to clipboard

如何阶段性保存模型?在训练过程中valset的coco_eval的AP一直是0?total_loss较大

Open 5RJ opened this issue 4 years ago • 13 comments

作者,你好,我有几个问题想请教一下:

  1. 我发现目前工程只在训练完全结束后才会保存模型,请问如何阶段性保存模型呢?我通过pip install安装了detectron2,随后在detectron2.engine.defaults.py中的DefaultTrainer增加train函数(以期覆盖TrainerBase中的train函数),具体代码如下(基于TrainerBase.train(), 增加了一行print, 以及阶段性保存模型的代码): ` def train(self, start_iter: int, max_iter: int): """ Args: start_iter, max_iter (int): See docs above """ logger = logging.getLogger(name) logger.info("Starting training from iteration {}".format(start_iter)) import ipdb; ipdb.set_trace() self.iter = self.start_iter = start_iter self.max_iter = max_iter

     with EventStorage(start_iter) as self.storage:
         try:
             self.before_train()
             print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',start_iter, max_iter)
             for self.iter in range(start_iter, max_iter):
                 self.before_step()
                 self.run_step()
                 self.after_step()
                 if self.iter % 100 == 0:
                     self.checkpointer.save("model_" + str(self.iter+1))
    
             # self.iter == max_iter can be used by `after_train` to
             # tell whether the training successfully finished or failed
             # due to exceptions.
             self.iter += 1
    
    
         except Exception:
             logger.exception("Exception during training:")
             raise
         finally:
             self.after_train()
    

` 然而并没有print对应的内容,模型也没保存上,请问正确的打开方式是什么呢?

  1. 现象:在训练过程中valset的coco_eval的AP一直是0。 环境配置:采用coco/centernet_res50_coco.yaml 进行目标检测任务,数据集按照coco格式处理好,且在xingyizhou发布的CenterNet工程上可以正常训练和测试。 在centerX上对cfg的修改: cfg.DATASETS.TRAIN = ("table_aline_train",) cfg.DATASETS.TEST = ("table_aline_val",) cfg.DATALOADER.NUM_WORKERS = 2 cfg.SOLVER.MAX_ITER = 30 cfg.OUTPUT_DIR = "./output/table_aline" cfg.SOLVER.IMS_PER_BATCH = 8 cfg.SOLVER.BASE_LR = 0.00125 cfg.INPUT.MAX_SIZE_TRAIN = 1024 cfg.INPUT.MIN_SIZE_TRAIN = 512

此外在main函数中借助register_coco_instances注册了我的数据集。

用作者提供的run.sh脚本,2块gpu运行。

train: 700+ val: 80+

具体问题 在训练过程中,发现在val set上做coco evaluation时,结果一直都是下图这样: 1211

在迭代了2300+次后,total_loss从1281降到了6.6左右,看inference中生成的框score很多接近1了,但是它们的位置远远超出了图片的尺寸(尺寸参考下面的信息),例如: {"image_id": 7, "category_id": 1, "bbox": [-120932.8515625, -51244.3125, 250420.453125, 95695.1640625], "score": 1.0}, {"image_id": 7, "category_id": 1, "bbox": [-146367.90625, -59846.8046875, 301889.0625, 119286.0078125], "score": 1.0}

已尝试的debug 对比total_loss相较原始centerNet上的训练情况(loss收敛到0.8左右),我怀疑也许dataloader加载的bbox有些问题,于是将数据集相关信息打印出来了,举个例子: centerX/modeling/meta_arch/centernet.py 中 CenterNet.forward()里,输出了batched_inputs[0],得到如下结果: {'file_name': '/mnt/maskrcnn-benchmark/datasets/table_aline/train2017/d-27.png', 'height': 2339, 'width': 1654, 'image_id': 174, 'image': tensor([[[170., 170., 170., ..., 170., 170., 170.], [170., 170., 170., ..., 170., 170., 170.], [170., 170., 170., ..., 170., 170., 170.], ..., [170., 170., 170., ..., 170., 170., 170.], [170., 170., 170., ..., 170., 170., 170.], [170., 170., 170., ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]]]), 'instances': Instances(num_instances=2, image_height=723, image_width=512, fields=[gt_boxes: Boxes(tensor([[ 16.7869,  44.9777, 473.3902, 106.7382],
    [ 15.7797, 415.2047, 476.4118, 686.4136]])), gt_classes: tensor([0, 0])])}

在annotations文件中,对应的标注信息是: {"category_id": 1, "id": 317, "image_id": 174, "iscrowd": 0, "segmentation": [[137.76953125, 1297.650390625, 1509.9000000000015, 1297.650390625, 1509.9000000000015, 2105.5, 137.76953125, 2105.5]], "area": 1108576.0, "bbox": [138.0, 1298.0, 1372.0, 808.0]} {"category_id": 1, "id": 316, "image_id": 174, "iscrowd": 0, "segmentation": [[146.541015625, 194.87890625, 1507.0552978515625, 194.87890625, 1507.0552978515625, 379.3728790283203, 146.541015625, 379.3728790283203]], "area": 250240.0, "bbox": [147.0, 195.0, 1360.0, 184.0]},

经过计算,height/image_height ≈ width/ image_width 然而,原始gt bbox(转换为x1,y1,x2,y2的格式为[138, 1298, 1510, 2106],[147, 195, 1507, 379])和batched_inputs中的bbox()并没有高和宽那样的比例关系,这里是正常的吗? 但是,惊讶的是,当我在centerX/modeling/layers/centernet_gt.py中generate函数 将画图部分代码取消注释后,观察了许多结果图片,框的位置并没有问题。 我有注意到,其实每张图片的shape是不同的,但generate函数里只传入了当前batch最后一张图的shape,并将所有图片按照这个shape(after scale)输出后续的gt,以确保一个batch里的score map是相同shape,这里会是症结所在吗?(原centernet是将图片resize为统一尺寸后,再进行后续的下采样,建gt等)

我现在很迷茫,不知道该如何解决这个问题,希望作者及了解的朋友可以指点迷津,万分感谢!

5RJ avatar Dec 11 '20 11:12 5RJ

图片似乎显示不出来,图片 内容如下: COCOeval_opt.evaluate() finished in 0.16 seconds. Accumulating evaluation results... COCOeval_opt.accumulate() finished in 0.02 seconds. Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001 [12/11 10:13:53 d2.evaluation.coco_evaluation]: Evaluation results for bbox: | AP | AP50 | AP75 | APs | APm | APl | | 0.000 | 0.000 | 0.000 | nan | nan | 0.000 | [12/11 10:13:53 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN. [12/11 10:13:53 d2.engine.defaults]: Evaluation results for table_aline_val in csv format: [12/11 10:13:53 d2.evaluation.testing]: copypaste: Task: bbox [12/11 10:13:53 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl [12/11 10:13:53 d2.evaluation.testing]: copypaste: 0.0000,0.0001,0.0000,nan,nan,0.0000

5RJ avatar Dec 11 '20 11:12 5RJ

非常感谢你详细的描述!接下来我们来解决问题:

  1. 阶段性保存模型的正确打开方式:cfg.SOLVER.CHECKPOINT_PERIOD,这个字段是用来调整保存模型的,你只需要在你的yaml的solver里面加一个这个字段就可以了
SOLVER:
    CHECKPOINT_PERIOD: 2 (每2个epoch数量保存一次模型,具体可以根据你的实际需求调整)
  1. 看上去你是用的你自己的私库,用自己的私库的话记得调整CENTERENT的NUM_CLASSES为你自己的类别数。
  2. val AP为0的可能问题: 1)你可以根据自己的私库,适当的调小BASE_LR,太大了在简单的数据集上可能不会收敛 2)你的图片像素值全部是170有点奇怪,不过我看到你自己在图上画了框,那应该没有问题 3)centerX和原始的centernet实现不太一样,我是复用的detectron2里面的random crop,而且每一个batch的shape都可能和上一个batch的不一样,这取决于数据集图片的shape。
  3. 可以先试一下在原始的coco数据集上能否正常训练,再对比一下自己的私库有哪些改动。

CPFLAME avatar Dec 14 '20 02:12 CPFLAME

@CPFLAME 使用coco训练也会出现类似问题

lbin avatar Dec 14 '20 07:12 lbin

@lbin 真是糟糕的消息,一度怀疑自己的代码哪里搞错了0 0.

可以看一下你的config或者改动么,我用默认的centernet_res18_coco_0.5.yaml跑结果是正常的。

CPFLAME avatar Dec 14 '20 09:12 CPFLAME

centernet_res18_coco_0.5.yaml 跑10次大概有2~3次是0.0000的mAP,什么也没改

lbin avatar Dec 14 '20 09:12 lbin

啊这。。 这个bug之前困扰了我很久,一度以为自己解决了

加了COMMUNISM,或者调小BASE_LR可能会更稳定一些

MODEL:
  CENTERNET:
    LOSS:
      COMMUNISM:
        ENABLE: True
        CLS_LOSS: 1.5
        WH_LOSS: 0.3
        OFF_LOSS: 0.1

CPFLAME avatar Dec 14 '20 09:12 CPFLAME

感谢解答,我试试!

5RJ avatar Dec 15 '20 03:12 5RJ

按照readme配好环境后,直接基于coco数据集训练(./run.sh),代码运行不成功: centerX/engine/defaults.py:71行处super(DefaultTrainer, self).init(model, data_loader, optimizer)在Detectron2的源码中未定义,作者是改过Detectron2的源码吗?@CPFLAME Solved by pip install -U 'git+https://github.com/CPFLAME/detectron2.git'

zc-tx avatar Dec 17 '20 04:12 zc-tx

@zc-tx pip install -U 'git+https://github.com/CPFLAME/detectron2.git' in https://github.com/CPFLAME/centerX/blob/master/README.md#requirements

lbin avatar Dec 17 '20 06:12 lbin

你好,用骨干网络res50训练出来的效果较差,请问提供其他的骨干网络的配置吗?@CPFLAME

Fly-dream12 avatar Dec 24 '20 13:12 Fly-dream12

@Fly-dream12 目前只有resnet和regnet,需要的话可以自行在backbone里面添加自己的网络

CPFLAME avatar Dec 25 '20 08:12 CPFLAME

@CPFLAME 请问这个项目里面有添加feature loss吗? 好像没看到,是否能提供一个实例呢

Fly-dream12 avatar Dec 27 '20 14:12 Fly-dream12

@5RJ 好喽,您好,我想用自己得两类数据训练一个学生模型,请问该如何使用?

liujia761 avatar Aug 14 '23 06:08 liujia761