InterFuser
InterFuser copied to clipboard
Questions about details of the network
Hi,
Thank you for your great work! I have some questions about details of the network.
- For CNN backbone, input images from different cameras are scaled to different size and cropped to different size (e.g. front: 800x600 => scale to 256x256 => crop to 224x224, left/right: 800x600 => scale to 160x160 => crop to 128x128, focus: 800x600 => crop to 128x128), are there any special reasons for different operations and choosing different size?
- For backbone, in the paper "We set C = 2048 and (H;W) = (H0/32 ; W0/32 ) in experiments.", any special reason for choosing 2048 and dividing by 32?
- After resnet, a convoluation is used to reduce channel from 2048 to 256, any special reason for choosing 256?
Thank you! :)
Hi,
- For the front view, we think it's the most important view, so we give it a largest size. For the side views, we take a smaller size to reduce the Flops of the network. For the focus view, we don't scale it and directly center-crop it to capture traffic light status at a distance.
- C = 2048 and (H;W) = (H0/32 ; W0/32 ) is just at a Stage 4 of a standard Resnet. H or W/32 is a proper resolution for the following transformer encoder. A higher resolution would lead to an increase in the O(N^2) calculation of the transformer. A lower resolution would cause a large performance drop.
- We have tried other choices (including 128, 256, 384) and found 256 channels got the best performance with fewer network parameters.
Hi,
Thank you for you replay! :)
- For the size (e.g. 256 or 128 or 160), do you try other size?
- For higher and lower resolution rather than dividing by 32, do you do any experiments (e.g. using H/16 or H/64)?
- I understand now. Thank you.
Thank you very much! :)
Hi, we haven't tried other input size or different resolutions in our experiments. But H/16 will bring 4X tokens and 16X flops, which may make it difficult to train or inference.
Thank you very much! :)