RAPiD icon indicating copy to clipboard operation
RAPiD copied to clipboard

confused about the format of your labeled datasets CEPDOF

Open cs-heibao opened this issue 4 years ago • 15 comments

the gt bounding-box format cx, cy, w, h angle, all of the five values are the rotated images corresponding parameters? and the angle is definition is the follow: 11

cs-heibao avatar Sep 12 '20 08:09 cs-heibao

I can't fully understand your question, but I guess you want to ask the definition of the angle in the ground truth bboxes. The angle is defined as the degrees that the bounding box is rotated clockwise. For example, in the following figure, the angle should be around 60 or -120 (degrees). angle=60 and angle=120 are equivalent. image

duanzhiihao avatar Sep 12 '20 08:09 duanzhiihao

@duanzhiihao hi, I mean the angle is which side(bw or bh) between x-axis or y-axis? and cx, cy w, h belongs to rotated object or original object?

cs-heibao avatar Sep 12 '20 14:09 cs-heibao

To be explicit, a rotated bounding box is described by cx, cy, w, h, angle. h is defined as the longer side of the bounding box; in other words, h is always greater or equal to w.

angle is between h and the y-axis (clockwise). Equivalently, it is also between w and the x-axis (clockwise).

cx, cy, w, h belongs to the rotated object.

duanzhiihao avatar Sep 15 '20 04:09 duanzhiihao

@duanzhiihao thanks, I got it, and I also visualize the labeled images. Another questions hope you give some instructions, I download the CEPDOF dataset, pretrained model and tried training using the script train.py, but the loss seems unusual, should I change some parameters?

Total time: 1:30:47.420144, iter: 0:00:10.916674, epoch: 2:48:08.225082
[Iteration 497] [learning rate 9.96e-05] [Total loss 86.83] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 0.12520018219947815
32_totalTrueobjects_loss_xy: 12.013134002685547, loss_wh: 0.28178492188453674, loss_angle: 2.6057467460632324, conf: 13.017786026000977
64_totalTrueobjects_loss_xy: 19.2332820892334, loss_wh: 0.3951633870601654, loss_angle: 4.4206318855285645, conf: 35.078147888183594
Max GPU memory usage: 3.3762731552124023 GigaBytes

Total time: 1:30:56.391230, iter: 0:00:10.912782, epoch: 2:48:04.676472
[Iteration 498] [learning rate 0.0001] [Total loss 84.99] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 0.037787601351737976
32_totalTrueobjects_loss_xy: 11.480892181396484, loss_wh: 0.27258607745170593, loss_angle: 1.4185596704483032, conf: 7.912797927856445
64_totalTrueobjects_loss_xy: 25.00128936767578, loss_wh: 0.3746016025543213, loss_angle: 5.575854301452637, conf: 33.23590087890625
Max GPU memory usage: 3.3762731552124023 GigaBytes

Total time: 1:31:05.636933, iter: 0:00:10.909455, epoch: 2:48:01.601010
[Iteration 499] [learning rate 0.0001] [Total loss 92.20] [img size 512]
16_totalTrueobjects_loss_xy: 1.1926631927490234, loss_wh: 0.007322967518121004, loss_angle: 0.009164094924926758, conf: 2.004310131072998
32_totalTrueobjects_loss_xy: 6.600027084350586, loss_wh: 0.037587691098451614, loss_angle: 1.4872658252716064, conf: 6.406645774841309
64_totalTrueobjects_loss_xy: 26.779525756835938, loss_wh: 0.48900607228279114, loss_angle: 6.459133625030518, conf: 40.99555969238281
Max GPU memory usage: 3.3762736320495605 GigaBytes

Total time: 1:31:14.235448, iter: 0:00:10.904851, epoch: 2:47:57.342678
[Iteration 500] [learning rate 0.0001] [Total loss 113.01] [img size 512]
16_total1e-16objects_loss_xy: 0.0, loss_wh: 0.0, loss_angle: 0.0, conf: 1.008183479309082
32_totalTrueobjects_loss_xy: 9.287293434143066, loss_wh: 0.08602897822856903, loss_angle: 1.5510506629943848, conf: 16.09341812133789
64_totalTrueobjects_loss_xy: 25.234947204589844, loss_wh: 0.4412153363227844, loss_angle: 8.447026252746582, conf: 51.121826171875
Max GPU memory usage: 3.3762731552124023 GigaBytes

cs-heibao avatar Sep 15 '20 05:09 cs-heibao

The format of the training log does seem unusual. Did you use the latest version of this repository? Or did you modify the following line? https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/models/rapid.py#L309

However, the numbers look fine to me. Can you try to test and visualize the trained model on some CEPDOF images and check the results?

duanzhiihao avatar Sep 15 '20 06:09 duanzhiihao

@duanzhiihao yes, the pretrained model: pL1_MWHB1024_Mar11_4000.ckpt test is ok

cs-heibao avatar Sep 15 '20 06:09 cs-heibao

@duanzhiihao I modified as yours, and got the follow result, but the same, and why the learning rate is so small? how about your train loss log?

Total time: 0:04:52.641023, iter: 0:00:22.510848, epoch: 5:46:42.661542
[Iteration 11] [learning rate 6.76e-08] [Total loss 168.46] [img size 544]
level_17 total 1 objects: xy/gt 1.428, wh/gt 0.002, angle/gt 0.154, conf 0.384
level_34 total 1 objects: xy/gt 10.540, wh/gt 0.131, angle/gt 2.384, conf 19.865
level_68 total 1 objects: xy/gt 27.822, wh/gt 0.842, angle/gt 11.726, conf 93.668
Max GPU memory usage: 3.7739853858947754 GigaBytes

Total time: 0:05:01.486867, iter: 0:00:21.534776, epoch: 5:31:40.604880
[Iteration 12] [learning rate 7.84e-08] [Total loss 160.44] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 0.144
level_34 total 1 objects: xy/gt 15.835, wh/gt 0.200, angle/gt 2.414, conf 9.016
level_68 total 1 objects: xy/gt 27.361, wh/gt 0.845, angle/gt 12.563, conf 92.584
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:10.213815, iter: 0:00:20.680921, epoch: 5:18:31.630590
[Iteration 13] [learning rate 9e-08] [Total loss 121.26] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 10.947
level_34 total 1 objects: xy/gt 16.386, wh/gt 0.183, angle/gt 1.434, conf 14.134
level_68 total 1 objects: xy/gt 16.564, wh/gt 0.470, angle/gt 3.947, conf 57.517
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:18.908381, iter: 0:00:19.931774, epoch: 5:06:59.296779
[Iteration 14] [learning rate 1.02e-07] [Total loss 218.47] [img size 544]
level_17 total 0 objects: xy/gt 0.000, wh/gt 0.000, angle/gt 0.000, conf 0.022
level_34 total 1 objects: xy/gt 9.118, wh/gt 0.101, angle/gt 2.331, conf 29.773
level_68 total 1 objects: xy/gt 37.067, wh/gt 0.961, angle/gt 10.717, conf 128.906
Max GPU memory usage: 3.773984909057617 GigaBytes

Total time: 0:05:27.616841, iter: 0:00:19.271579, epoch: 4:56:49.172433
[Iteration 15] [learning rate 1.16e-07] [Total loss 153.43] [img size 544]
level_17 total 1 objects: xy/gt 3.641, wh/gt 0.050, angle/gt 1.801, conf 7.798
level_34 total 1 objects: xy/gt 6.590, wh/gt 0.022, angle/gt 0.680, conf 11.009
level_68 total 1 objects: xy/gt 34.764, wh/gt 0.561, angle/gt 15.035, conf 71.796
Max GPU memory usage: 3.7739853858947754 GigaBytes

Total time: 0:05:36.311363, iter: 0:00:18.683965, epoch: 4:47:46.116816
[Iteration 16] [learning rate 1.3e-07] [Total loss 135.15] [img size 544]
level_17 total 1 objects: xy/gt 1.351, wh/gt 0.036, angle/gt 0.093, conf 0.208
level_34 total 1 objects: xy/gt 9.287, wh/gt 0.108, angle/gt 1.949, conf 8.007
level_68 total 1 objects: xy/gt 26.259, wh/gt 0.666, angle/gt 8.405, conf 79.191
Max GPU memory usage: 3.7739853858947754 GigaBytes

cs-heibao avatar Sep 15 '20 07:09 cs-heibao

Unfortunately, I can't find my training log. To me, the loss numbers that you showed look reasonable. I guess you think the loss should be around 0, but it's not the case here. For example, although x=0.5 is the minimum of binary_cross_entropy(x, 0.5), binary_cross_entropy(0.5, 0.5) is not 0. Also, our loss is 'sum' (but not 'mean'), so typically it will be relatively large.https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/models/rapid.py#L113

I mean, you can wait for some time, and test on CEPDOF images using your trained model. If the detection results look reasonable, then you are fine.

The learning rate is small at the beginning of the training, and it's increasing over time. You can modify the code here if you prefer a larger learning rate. https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/train.py#L153 https://github.com/duanzhiihao/RAPiD/blob/0a9440f89b9bf9d17a7b66ba5acabc7cd3c9eb7f/train.py#L218

duanzhiihao avatar Sep 15 '20 07:09 duanzhiihao

@duanzhiihao yes, actually, I think the loss should be around to 0 as my other project it's appreciated for your guidance, I'll train my own dataset and check the model, thanks for your great idea for the project

cs-heibao avatar Sep 15 '20 07:09 cs-heibao

You are welcome! Please tell me if you have other questions.

duanzhiihao avatar Sep 15 '20 08:09 duanzhiihao

@duanzhiihao another problems, during the train process, in some iteration exists such error:

Traceback (most recent call last):
  File "/*****/RAPiD-master/train.py", line 263, in <module>
    loss = model(imgs, targets, labels_cats=cats)
  File "/*****/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/*****/RAPiD-master/models/rapid.py", line 78, in forward
    boxes_S, loss_S = self.pred_S(detect_S, self.img_size, labels)
  File "/*****/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/*****/RAPiD-master/models/rapid.py", line 285, in forward
    target[b,best_n,truth_j,truth_i,0] = tx_all[b,:n][valid_mask] - tx_all[b,:n][valid_mask].floor()
RuntimeError: copy_if failed to synchronize: device-side assert triggered

cs-heibao avatar Sep 15 '20 09:09 cs-heibao

According to https://github.com/facebookresearch/maskrcnn-benchmark/issues/658#issuecomment-481923633, it's because the learning rate is too large.

duanzhiihao avatar Sep 15 '20 11:09 duanzhiihao

@duanzhiihao hi, as for CEPDOF datasets it runs well with batchsize of 4 and learning rate follows the scheduler, for my own datasets occurs this. But I make the dataset as the format of CEPDOF datasets, what's at the begining of the training the learning rate is small actually. and by the way, how do I reduce the learning rate further? thanks

cs-heibao avatar Sep 15 '20 14:09 cs-heibao

@duanzhiihao I've found the problem: maybe since my own data is different of CEPDOF dataset, and after debug, it seems horizontal flip , vertical flip and augUtils.rotate in augment_PIL operation cause the bounding bbox outside images. and I comment the augmentation, runs well

cs-heibao avatar Sep 16 '20 09:09 cs-heibao

@duanzhiihao

why the angle is 60? why not -60?

twmht avatar Oct 20 '21 12:10 twmht