MultiNet I do not understand the number in paper.(Convolution and concatenated)

When I read a thesis, I do not understand it.

First, for classification, in the initial paper of 2016, 1x1 convolution was used, but in the 2018 paper (Figure2), 3x3 convolution was used. Are there any spacial reasons for changing numbers? In addition, at the top of page 4, it is more confusing to say that 'we first apply a 1x1 convolution with 30 channels'. I wonder what number is correct.

Second, I want to see why the concated fetures in the Detection Decoder in Figure2 of the 2018 paper are expressed as 39x12x1526. According to my calculations, ROI Aligh 128 channels is concatenated with 128*8=1024.(+ I wonder why I see 8 instead of 9, except for the existing results in the middle), 500 channels in the Bottleneck block, and finally Prediction 6 channels are concatenated, so the final result is supposed to be 1024+500+6=1530. I will be very grateful if you let me know if I have the wrong part. I have been thinking about this number for a long time, but there is no other conclusion.

I look forward to your reply. Thank you.

Jul 16 '18 01:07 rachel1994

I have the same doubt. Have you sloved this problem?

Oct 12 '21 09:10 zhoupan9109

Great Job! But I still have a question. I can't understand the number, too. According to the paper, features transform from (156×48×128) to (39×12×1020) using ROI Align. And I feel confused about this step. If you could expand on it, I would appreciate it. Thanks a lot in advance.

Jan 29 '22 03:01 HerrYu123