TubeViT icon indicating copy to clipboard operation
TubeViT copied to clipboard

Number of Tokens different than papers

Open daniel-code opened this issue 2 years ago • 2 comments

The number of tokens in the paper is 559 tokens (ch4.1), but the number of tokens in my implementation is 539.

  • 8 x 8 x 8 with a stride of (16, 32, 32)
  • 16 x 4 x 4 with a stride of 6 x 32 x 32 and an offset of (4, 8, 8)
  • 4 x 12 x 12 with a stride of 16 x 32 x 32) and an offset of (0, 16, 16)
  • 1 x 16 x 16 with a stride of (32, 16, 16).

For an input of 32 x 224 x 224, this results in only 559 tokens

The number of tokens in implementation

  • 8 x 8 x 8 with a stride of (16, 32, 32) -> 98
  • 16 x 4 x 4 with a stride of 6 x 32 x 32 and an offset of (4, 8, 8) -> 147
  • 4 x 12 x 12 with a stride of 16 x 32 x 32) and an offset of (0, 16, 16) -> 98
  • 1 x 16 x 16 with a stride of (32, 16, 16) -> 196

The total of tokens is 98+147+98+196 = 539

daniel-code avatar Feb 26 '23 07:02 daniel-code

Hi, I think this may have been a bug, and the number of tubes was not counted correctly in the code. Specifically, it is possible that: https://github.com/daniel-code/TubeViT/blob/main/tubevit/model.py#L219C1-L220C1

Should be changed to:

        output = np.floor(
            (
                (self.video_shape[[1, 2, 3]] - offset - kernel_size) /
                (stride)
            ) + 1
        ).astype(int)

This is supported by the Shape section under: https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html#conv3d, and the use of offset here: https://github.com/daniel-code/TubeViT/blob/main/tubevit/model.py#L79

MichaHK avatar Mar 20 '24 15:03 MichaHK

Thank you for reporting the issue. The function is fixed by #26

This function is only used for calculating the position embedding. The tokens are generated by F.conv3d in SparseTubesTokenizer, so the number of tokens is still mismatched with paper. (539 vs. 559)

After discussing with the authors, the original paper is implemented in Tensorflow, they used conv3d with padding="same". I try to pad the image with different values but can't get the same number of tokens.

image

daniel-code avatar Mar 21 '24 12:03 daniel-code