trackformer icon indicating copy to clipboard operation
trackformer copied to clipboard

About hidden layer dimension change

Open davidyang180 opened this issue 3 years ago • 1 comments

Hi! I observed by debugging the code, why when using the multi-frame training strategy, the hidden layer dimension should be changed to 288, which does not seem to be mentioned in the paper.

davidyang180 avatar Sep 02 '22 04:09 davidyang180

This was necessary to apply the a spatiotemporal encoding of the input pixels. VisTR did something similar for Video Instance Segmentation and increased their hidden size to 384. The spatiotemporal encoding of height, width and time requires the hidden size to be divisible by three. And there are some additional constraints, for example, that the hidden size is divisible by the number of attention heads. If you need to stick with the hidden size of 256, you could try to apply a learned temporal encoding as done in this project.

timmeinhardt avatar Sep 02 '22 15:09 timmeinhardt