tf-rpn Run on Colab

Hi Furkan,

Recently, I want to find out to how RPN work. I came across your repository. It's seem interesting.

But when I run your code in colab. I get error like that.

Traceback (most recent call last): File "trainer.py", line 69, in callbacks=[checkpoint_callback]) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper return method(self, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit tmp_logs = train_function(iterator) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in call result = self._call(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call return self._stateless_fn(*args, **kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call cancellation_manager=cancellation_manager) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [8,9216] vs. [8] [[node gradient_tape/reg_loss/mul/BroadcastGradientArgs (defined at trainer.py:69) ]] [Op:__inference_train_function_12539]

Function call stack: train_function

==== Have you been run on colab and got the same error?

Nov 09 '20 11:11 ttcong194

Yes, when I first wrote, I did tests on colab. The error you encountered is probably related to broadcasting. It is probably caused by a tf version mismatch. I was using the tf2.0 version while doing the tests, it will probably work on tf 2.0 version

Nov 09 '20 12:11 FurkanOM

It mean I should downgrade the tensorflow version to 2.0 ?

Nov 09 '20 12:11 ttcong194

It work when I downgrade tf to 2.1.0 version. But do you know where cause the broadcasting problem on new tf2.3 ? I'm newbie and maybe your code is fairly complicated. :D

Nov 09 '20 14:11 ttcong194

I don't know where it is, I have to look it up to find it, but now I'm working on different projects. If you want to solve the problem(s), you should check all conditional / broadcasting tf operations like tf.where, tf.logical_and, tf.logical_or, etc.

Nov 09 '20 15:11 FurkanOM

I'm pretty sure the issue could come from https://github.com/FurkanOM/tf-rpn/blob/9261162f246b45b41b1408770d8894ee865f8b80/utils/train_utils.py#L183 There pos_mask has shape (batch_size, total_anchors) while loss_for_all has shape (batch_size)

Doing

loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))

allows tensorflow to do broadcasting on loss_for_all and makes the loss function work. At the same time im not sure this is correct because this effectively sums the total loss of each batch for each positive anchor and actually yields quite big regressor loss values (which im not sure is correct)

I have made this implementation myself of what I think it should be, but Im really not sure it's working correctly

def regressor_loss(t_true, t_pred):
    """
        The regressor loss
        This is smooth L1 loss calculated only on positive anchors
        Args:
            t_true, the true regressor values
            t_pred, the predicted regressor values
        Returns:
            the loss as a scalar
    """
    smooth_l1 = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.NONE)
    batch_size = tf.shape(t_pred)[0]
    t_true = tf.reshape(t_true, [batch_size, -1, 4])
    t_pred = tf.reshape(t_pred, [batch_size, -1, 4])
    loss = smooth_l1(t_true, t_pred)
    valid = tf.math.reduce_any(tf.not_equal(t_true, 0.0), axis=-1)
    valid = tf.cast(valid, tf.float32)
    loss = tf.reduce_sum(loss * valid, axis=-1) # loss vector for each batch
    total_pos_boxes = tf.math.maximum(1.0, tf.reduce_sum(valid, axis=-1))
    return tf.math.reduce_mean(tf.truediv(loss, total_pos_boxes))

Hope this helps @ttcong194

Although im not sure my implementation is correct either

Nov 09 '20 17:11 lucabtz

@ekardnam @FurkanOM thanks for your help. I try it now.

Nov 10 '20 02:11 ttcong194

If you happen to make it work could you share your results with me @ttcong194 ?

Nov 10 '20 17:11 lucabtz

@ekardnam When I change code

loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))

and do training. I see the loss is so big at first. And after few consecutive epochs I saw the loss don't change. It fluctuate among value of 27.
(https://drive.google.com/file/d/17yi4pLA2IY3vUd0-k-kh9H84OTO8hH24/view?usp=sharing)

Nov 11 '20 03:11 ttcong194

And this is log when I use your regressor_loss function:

Epoch 1/50
  2/627 [..............................] - ETA: 5:42 - loss: 1.4957 - rpn_reg_loss: 0.7583 - rpn_cls_loss: 0.7375WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2407s vs `on_train_batch_end` time: 0.4332s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2407s vs `on_train_batch_end` time: 0.4332s). Check your callbacks.
627/627 [==============================] - 615s 980ms/step - loss: 0.7975 - rpn_reg_loss: 0.4502 - rpn_cls_loss: 0.3474 - val_loss: 0.6932 - val_rpn_reg_loss: 0.4033 - val_rpn_cls_loss: 0.2899
Epoch 2/50
627/627 [==============================] - 617s 984ms/step - loss: 0.6696 - rpn_reg_loss: 0.3916 - rpn_cls_loss: 0.2781 - val_loss: 0.6377 - val_rpn_reg_loss: 0.3767 - val_rpn_cls_loss: 0.2610
Epoch 3/50
627/627 [==============================] - 618s 985ms/step - loss: 0.6128 - rpn_reg_loss: 0.3560 - rpn_cls_loss: 0.2568 - val_loss: 0.6169 - val_rpn_reg_loss: 0.3649 - val_rpn_cls_loss: 0.2519
Epoch 4/50
627/627 [==============================] - 617s 984ms/step - loss: 0.5720 - rpn_reg_loss: 0.3300 - rpn_cls_loss: 0.2420 - val_loss: 0.5938 - val_rpn_reg_loss: 0.3544 - val_rpn_cls_loss: 0.2393
Epoch 5/50
627/627 [==============================] - 618s 985ms/step - loss: 0.5411 - rpn_reg_loss: 0.3094 - rpn_cls_loss: 0.2317 - val_loss: 0.5825 - val_rpn_reg_loss: 0.3481 - val_rpn_cls_loss: 0.2343
Epoch 6/50
627/627 [==============================] - 618s 985ms/step - loss: 0.5118 - rpn_reg_loss: 0.2904 - rpn_cls_loss: 0.2214 - val_loss: 0.5699 - val_rpn_reg_loss: 0.3442 - val_rpn_cls_loss: 0.2257
Epoch 7/50
627/627 [==============================] - 618s 985ms/step - loss: 0.4905 - rpn_reg_loss: 0.2763 - rpn_cls_loss: 0.2142 - val_loss: 0.5711 - val_rpn_reg_loss: 0.3433 - val_rpn_cls_loss: 0.2279
Epoch 8/50
627/627 [==============================] - 617s 984ms/step - loss: 0.4692 - rpn_reg_loss: 0.2644 - rpn_cls_loss: 0.2048 - val_loss: 0.5641 - val_rpn_reg_loss: 0.3433 - val_rpn_cls_loss: 0.2208
Epoch 9/50
627/627 [==============================] - 618s 985ms/step - loss: 0.4510 - rpn_reg_loss: 0.2533 - rpn_cls_loss: 0.1976 - val_loss: 0.5567 - val_rpn_reg_loss: 0.3397 - val_rpn_cls_loss: 0.2170
Epoch 10/50
627/627 [==============================] - 618s 986ms/step - loss: 0.4352 - rpn_reg_loss: 0.2440 - rpn_cls_loss: 0.1912 - val_loss: 0.5547 - val_rpn_reg_loss: 0.3417 - val_rpn_cls_loss: 0.2130
Epoch 11/50
627/627 [==============================] - 620s 989ms/step - loss: 0.4189 - rpn_reg_loss: 0.2339 - rpn_cls_loss: 0.1850 - val_loss: 0.5509 - val_rpn_reg_loss: 0.3394 - val_rpn_cls_loss: 0.2116
Epoch 12/50
627/627 [==============================] - 621s 990ms/step - loss: 0.4055 - rpn_reg_loss: 0.2263 - rpn_cls_loss: 0.1792 - val_loss: 0.5427 - val_rpn_reg_loss: 0.3336 - val_rpn_cls_loss: 0.2091
Epoch 13/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3905 - rpn_reg_loss: 0.2178 - rpn_cls_loss: 0.1726 - val_loss: 0.5486 - val_rpn_reg_loss: 0.3388 - val_rpn_cls_loss: 0.2098
Epoch 14/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3799 - rpn_reg_loss: 0.2118 - rpn_cls_loss: 0.1682 - val_loss: 0.5414 - val_rpn_reg_loss: 0.3361 - val_rpn_cls_loss: 0.2053
Epoch 15/50
627/627 [==============================] - 620s 989ms/step - loss: 0.3696 - rpn_reg_loss: 0.2068 - rpn_cls_loss: 0.1627 - val_loss: 0.5466 - val_rpn_reg_loss: 0.3372 - val_rpn_cls_loss: 0.2094
Epoch 16/50
627/627 [==============================] - 620s 988ms/step - loss: 0.3574 - rpn_reg_loss: 0.1998 - rpn_cls_loss: 0.1576 - val_loss: 0.5423 - val_rpn_reg_loss: 0.3353 - val_rpn_cls_loss: 0.2069
Epoch 17/50
627/627 [==============================] - 618s 986ms/step - loss: 0.3484 - rpn_reg_loss: 0.1956 - rpn_cls_loss: 0.1528 - val_loss: 0.5459 - val_rpn_reg_loss: 0.3385 - val_rpn_cls_loss: 0.2074

Nov 11 '20 07:11 ttcong194

Yeah I tried changing to

loc_loss = tf.reduce_sum(pos_mask * tf.expand_dims(loss_for_all, axis=-1))

as well and got that kinda results too. And also using my loss function I get similar results although I might have some other bugs in my code since the regressor isnt training correctly.

Have you tried predicting boxes on some image using the loss function i posted?

Nov 11 '20 15:11 lucabtz

Try upgrading the repo code through the tf_upgrade_v2 tool provided by Tensorflow: tf_upgrade_v2 --intree path_to_tf-rpn/ --inplace --reportfile report.txt More info at https://www.tensorflow.org/guide/upgrade

May 24 '21 11:05 Uzarel