InterFuser icon indicating copy to clipboard operation
InterFuser copied to clipboard

Questions about details of the network

Open zliu950219 opened this issue 2 years ago • 4 comments

Hi,

Thank you for your great work! I have some questions about details of the network.

  1. For CNN backbone, input images from different cameras are scaled to different size and cropped to different size (e.g. front: 800x600 => scale to 256x256 => crop to 224x224, left/right: 800x600 => scale to 160x160 => crop to 128x128, focus: 800x600 => crop to 128x128), are there any special reasons for different operations and choosing different size?
  2. For backbone, in the paper "We set C = 2048 and (H;W) = (H0/32 ; W0/32 ) in experiments.", any special reason for choosing 2048 and dividing by 32?
  3. After resnet, a convoluation is used to reduce channel from 2048 to 256, any special reason for choosing 256?

Thank you! :)

zliu950219 avatar Apr 04 '23 15:04 zliu950219

Hi,

  1. For the front view, we think it's the most important view, so we give it a largest size. For the side views, we take a smaller size to reduce the Flops of the network. For the focus view, we don't scale it and directly center-crop it to capture traffic light status at a distance.
  2. C = 2048 and (H;W) = (H0/32 ; W0/32 ) is just at a Stage 4 of a standard Resnet. H or W/32 is a proper resolution for the following transformer encoder. A higher resolution would lead to an increase in the O(N^2) calculation of the transformer. A lower resolution would cause a large performance drop.
  3. We have tried other choices (including 128, 256, 384) and found 256 channels got the best performance with fewer network parameters.

deepcs233 avatar Apr 04 '23 16:04 deepcs233

Hi,

Thank you for you replay! :)

  1. For the size (e.g. 256 or 128 or 160), do you try other size?
  2. For higher and lower resolution rather than dividing by 32, do you do any experiments (e.g. using H/16 or H/64)?
  3. I understand now. Thank you.

Thank you very much! :)

zliu950219 avatar Apr 05 '23 09:04 zliu950219

Hi, we haven't tried other input size or different resolutions in our experiments. But H/16 will bring 4X tokens and 16X flops, which may make it difficult to train or inference.

deepcs233 avatar Apr 05 '23 14:04 deepcs233

Thank you very much! :)

zliu950219 avatar Apr 06 '23 08:04 zliu950219