XProNet icon indicating copy to clipboard operation
XProNet copied to clipboard

Are cross-modal feature and cross-model representation vector same?

Open DanyangCheng opened this issue 1 year ago • 2 comments

In your parper you write:"we concatenate the visual and textual representations to form the cross-modal features $$r\in \mathbb{R} ^{1\times D}$$", but the formular below writes:" $$o_u=Concate(o_u^{i(f)},o_u^t)$$", Are they the same vector? and in this formular: $$PM(k,i)=\frac{1}{N_{k,i}^s}\sum_{j=0}^N r_j^{k,i}$$ what's the meaning of $$N_{k,i}^s$$ ? I didn't find these details in the source code. It is my understand that you first extract visual and textual representation and concate them to form the cross-modal feature $$r_u=Concat(o_u^{i(f)},o^t_u)$$, and grouped them into $$N_l$$ sets{ $$R_k;0 \le k \le N_l$$ } according to the sample label, then applying K-Means on each $$R_k$$ which split $$R_k$$ into $$N^p$$ cluster. Finally, take the average of the vectors within the cluster as the prototype vector $$PM(k,i)$$ . Is this understanding correct?

DanyangCheng avatar Nov 13 '24 12:11 DanyangCheng

In your parper you write:"we concatenate the visual and textual representations to form the cross-modal features r∈R1×D", but the formular below writes:" ou=Concate(oui(f),out)", Are they the same vector? and in this formular: PM(k,i)=1Nk,is∑j=0Nrjk,i what's the meaning of Nk,is ? I didn't find these details in the source code. It is my understand that you first extract visual and textual representation and concate them to form the cross-modal feature ru=Concat(oui(f),out), and grouped them into Nl sets{ Rk;0≤k≤Nl } according to the sample label, then applying K-Means on each Rk which split Rk into Np cluster. Finally, take the average of the vectors within the cluster as the prototype vector PM(k,i) . Is this understanding correct?

Hi, thank you for your interest to our work. o and r are both the cross-modal features. We use two chracters to refer the cross-modal features as o_u is associate with specific sample u, while r is used to index the cross-modal feature after clustering.

$N^s_{k,i}$ , sorry this is a typo here, it should be $N^d_{k,i}$.

You are right, the procedure of the prototype initialization is the same as you summarize.

Hope this information could help you figure out the problem.

Best Regards, Jun

Markin-Wang avatar Nov 15 '24 14:11 Markin-Wang

Your reply helped me a lot, and your work is great.

DanyangCheng avatar Nov 17 '24 09:11 DanyangCheng