Run on Colab
Hi Furkan,
Recently, I want to find out to how RPN work. I came across your repository. It's seem interesting.
But when I run your code in colab. I get error like that.
Traceback (most recent call last):
File "trainer.py", line 69, in
Function call stack: train_function
==== Have you been run on colab and got the same error?
Yes, when I first wrote, I did tests on colab. The error you encountered is probably related to broadcasting. It is probably caused by a tf version mismatch. I was using the tf2.0 version while doing the tests, it will probably work on tf 2.0 version
It mean I should downgrade the tensorflow version to 2.0 ?
It work when I downgrade tf to 2.1.0 version. But do you know where cause the broadcasting problem on new tf2.3 ? I'm newbie and maybe your code is fairly complicated. :D
I don't know where it is, I have to look it up to find it, but now I'm working on different projects. If you want to solve the problem(s), you should check all conditional / broadcasting tf operations like tf.where, tf.logical_and, tf.logical_or, etc.
I'm pretty sure the issue could come from https://github.com/FurkanOM/tf-rpn/blob/9261162f246b45b41b1408770d8894ee865f8b80/utils/train_utils.py#L183 There pos_mask has shape (batch_size, total_anchors) while loss_for_all has shape (batch_size)
Doing
loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))
allows tensorflow to do broadcasting on loss_for_all and makes the loss function work. At the same time im not sure this is correct because this effectively sums the total loss of each batch for each positive anchor and actually yields quite big regressor loss values (which im not sure is correct)
I have made this implementation myself of what I think it should be, but Im really not sure it's working correctly
def regressor_loss(t_true, t_pred):
"""
The regressor loss
This is smooth L1 loss calculated only on positive anchors
Args:
t_true, the true regressor values
t_pred, the predicted regressor values
Returns:
the loss as a scalar
"""
smooth_l1 = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.NONE)
batch_size = tf.shape(t_pred)[0]
t_true = tf.reshape(t_true, [batch_size, -1, 4])
t_pred = tf.reshape(t_pred, [batch_size, -1, 4])
loss = smooth_l1(t_true, t_pred)
valid = tf.math.reduce_any(tf.not_equal(t_true, 0.0), axis=-1)
valid = tf.cast(valid, tf.float32)
loss = tf.reduce_sum(loss * valid, axis=-1) # loss vector for each batch
total_pos_boxes = tf.math.maximum(1.0, tf.reduce_sum(valid, axis=-1))
return tf.math.reduce_mean(tf.truediv(loss, total_pos_boxes))
Hope this helps @ttcong194
Although im not sure my implementation is correct either
@ekardnam @FurkanOM thanks for your help. I try it now.
If you happen to make it work could you share your results with me @ttcong194 ?
@ekardnam When I change code
loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))
and do training. I see the loss is so big at first. And after few consecutive epochs I saw the loss don't change. It fluctuate among value of 27.
(https://drive.google.com/file/d/17yi4pLA2IY3vUd0-k-kh9H84OTO8hH24/view?usp=sharing)
And this is log when I use your regressor_loss function:
Epoch 1/50
2/627 [..............................] - ETA: 5:42 - loss: 1.4957 - rpn_reg_loss: 0.7583 - rpn_cls_loss: 0.7375WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2407s vs `on_train_batch_end` time: 0.4332s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2407s vs `on_train_batch_end` time: 0.4332s). Check your callbacks.
627/627 [==============================] - 615s 980ms/step - loss: 0.7975 - rpn_reg_loss: 0.4502 - rpn_cls_loss: 0.3474 - val_loss: 0.6932 - val_rpn_reg_loss: 0.4033 - val_rpn_cls_loss: 0.2899
Epoch 2/50
627/627 [==============================] - 617s 984ms/step - loss: 0.6696 - rpn_reg_loss: 0.3916 - rpn_cls_loss: 0.2781 - val_loss: 0.6377 - val_rpn_reg_loss: 0.3767 - val_rpn_cls_loss: 0.2610
Epoch 3/50
627/627 [==============================] - 618s 985ms/step - loss: 0.6128 - rpn_reg_loss: 0.3560 - rpn_cls_loss: 0.2568 - val_loss: 0.6169 - val_rpn_reg_loss: 0.3649 - val_rpn_cls_loss: 0.2519
Epoch 4/50
627/627 [==============================] - 617s 984ms/step - loss: 0.5720 - rpn_reg_loss: 0.3300 - rpn_cls_loss: 0.2420 - val_loss: 0.5938 - val_rpn_reg_loss: 0.3544 - val_rpn_cls_loss: 0.2393
Epoch 5/50
627/627 [==============================] - 618s 985ms/step - loss: 0.5411 - rpn_reg_loss: 0.3094 - rpn_cls_loss: 0.2317 - val_loss: 0.5825 - val_rpn_reg_loss: 0.3481 - val_rpn_cls_loss: 0.2343
Epoch 6/50
627/627 [==============================] - 618s 985ms/step - loss: 0.5118 - rpn_reg_loss: 0.2904 - rpn_cls_loss: 0.2214 - val_loss: 0.5699 - val_rpn_reg_loss: 0.3442 - val_rpn_cls_loss: 0.2257
Epoch 7/50
627/627 [==============================] - 618s 985ms/step - loss: 0.4905 - rpn_reg_loss: 0.2763 - rpn_cls_loss: 0.2142 - val_loss: 0.5711 - val_rpn_reg_loss: 0.3433 - val_rpn_cls_loss: 0.2279
Epoch 8/50
627/627 [==============================] - 617s 984ms/step - loss: 0.4692 - rpn_reg_loss: 0.2644 - rpn_cls_loss: 0.2048 - val_loss: 0.5641 - val_rpn_reg_loss: 0.3433 - val_rpn_cls_loss: 0.2208
Epoch 9/50
627/627 [==============================] - 618s 985ms/step - loss: 0.4510 - rpn_reg_loss: 0.2533 - rpn_cls_loss: 0.1976 - val_loss: 0.5567 - val_rpn_reg_loss: 0.3397 - val_rpn_cls_loss: 0.2170
Epoch 10/50
627/627 [==============================] - 618s 986ms/step - loss: 0.4352 - rpn_reg_loss: 0.2440 - rpn_cls_loss: 0.1912 - val_loss: 0.5547 - val_rpn_reg_loss: 0.3417 - val_rpn_cls_loss: 0.2130
Epoch 11/50
627/627 [==============================] - 620s 989ms/step - loss: 0.4189 - rpn_reg_loss: 0.2339 - rpn_cls_loss: 0.1850 - val_loss: 0.5509 - val_rpn_reg_loss: 0.3394 - val_rpn_cls_loss: 0.2116
Epoch 12/50
627/627 [==============================] - 621s 990ms/step - loss: 0.4055 - rpn_reg_loss: 0.2263 - rpn_cls_loss: 0.1792 - val_loss: 0.5427 - val_rpn_reg_loss: 0.3336 - val_rpn_cls_loss: 0.2091
Epoch 13/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3905 - rpn_reg_loss: 0.2178 - rpn_cls_loss: 0.1726 - val_loss: 0.5486 - val_rpn_reg_loss: 0.3388 - val_rpn_cls_loss: 0.2098
Epoch 14/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3799 - rpn_reg_loss: 0.2118 - rpn_cls_loss: 0.1682 - val_loss: 0.5414 - val_rpn_reg_loss: 0.3361 - val_rpn_cls_loss: 0.2053
Epoch 15/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3696 - rpn_reg_loss: 0.2068 - rpn_cls_loss: 0.1627 - val_loss: 0.5466 - val_rpn_reg_loss: 0.3372 - val_rpn_cls_loss: 0.2094
Epoch 16/50
627/627 [==============================] - 620s 988ms/step - loss: 0.3574 - rpn_reg_loss: 0.1998 - rpn_cls_loss: 0.1576 - val_loss: 0.5423 - val_rpn_reg_loss: 0.3353 - val_rpn_cls_loss: 0.2069
Epoch 17/50
627/627 [==============================] - 618s 986ms/step - loss: 0.3484 - rpn_reg_loss: 0.1956 - rpn_cls_loss: 0.1528 - val_loss: 0.5459 - val_rpn_reg_loss: 0.3385 - val_rpn_cls_loss: 0.2074
Yeah I tried changing to
loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))
as well and got that kinda results too. And also using my loss function I get similar results although I might have some other bugs in my code since the regressor isnt training correctly.
Have you tried predicting boxes on some image using the loss function i posted?
Try upgrading the repo code through the tf_upgrade_v2 tool provided by Tensorflow:
tf_upgrade_v2 --intree path_to_tf-rpn/ --inplace --reportfile report.txt
More info at https://www.tensorflow.org/guide/upgrade