T-Rex About Visual Prompt Encoder.

About Visual Prompt Encoder.

Open fuweifu-vtoo opened this issue 6 months ago • 3 comments

Dear author, I have another question for you：

In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?

Or stacking three blocks of (Deformable cross attention + self attention + FFN)

Aug 09 '24 09:08 fuweifu-vtoo