APE
APE copied to clipboard
Inquiry on the "gated cross-modality interaction"
Hi! Thanks for open-sourcing APE, it is fantastic! 👍
I am new to the field of open-vocabulary vision foundation models, and I have some questions on the "gated cross-modality interaction" when going through your paper, hoping to seek your insights on a few points.
I understand that the interaction of image features and text features in GLIP causes expensive computation. But I couldn't get the part of "all-zero token", quoted:
Instead, an all-zero token Pzero serves as a special text embedding and inputs to the fusion module for all given vocabularies. In this situation, the fusion process is “static”, as no language information is injected into vision features. The Pzero could provide explicit instructions to recognize primitive concepts and slightly tune vision feature Vvoc and retain original language feature Pvoc.
- How does it work? I mean, how does an all-zero token provide instructions to recognize concepts?
- In this paragraph, it seems that this token is only applied for word prompts, while deprecated for sentence prompts? But in Figure 2, the zero token is interacting with sentence prompts. Am I missing something?
- Where is the corresponding code for Pzero? Is it https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm.py#L220 ?
Sorry for this late response.
- As the all-zero token is different from other text tokens, it does not provide any text information, so the model may be awarded to perform OVD and OVS tasks.
- we only use this token for vocabulary prompts, but this token can also be used with sentence prompts, which has no effect.
-
deformable_detr_segm.py
is the no-fusion model, fusion model isdeformable_detr_segm_vl.py
, The all-zero token isself.name_prompt_fusion_feature
. The corresponding code is here: https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm_vl.py#L158