CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

About 3D Swin Attention

Open lemon-prog123 opened this issue 2 years ago • 1 comments

In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

lemon-prog123 avatar Aug 07 '22 02:08 lemon-prog123

Hi, different attention channels are calculated independently, and are added up later in the unit of tokens instead of patches. As mentioned in sec 3.2 in our paper, the temporal channel (attention-plus)'s window size is (A_x, A_y, T_s), therefore we adopt 3D swin attention; the spatial channel (attention-base)'s window size is (X, Y, 1), therefore we adopt 2D attention in each frame in parallel.

wenyihong avatar Aug 11 '22 09:08 wenyihong