T-Rex
T-Rex copied to clipboard
About Visual Prompt Encoder.
Dear author, I have another question for you:
In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?
Or stacking three blocks of (Deformable cross attention + self attention + FFN)