About hidden layer dimension change
Hi! I observed by debugging the code, why when using the multi-frame training strategy, the hidden layer dimension should be changed to 288, which does not seem to be mentioned in the paper.
This was necessary to apply the a spatiotemporal encoding of the input pixels. VisTR did something similar for Video Instance Segmentation and increased their hidden size to 384. The spatiotemporal encoding of height, width and time requires the hidden size to be divisible by three. And there are some additional constraints, for example, that the hidden size is divisible by the number of attention heads. If you need to stick with the hidden size of 256, you could try to apply a learned temporal encoding as done in this project.