EVF-SAM BEIT-3-Large - Layer fusion

Hi thanks for your great work, exploring BEIT as an alternative to CLIP.

I find it very well motivated in the paper, but I struggle to reproduce the BEIT3 results in my independent training codebase. So far I can match / surpass clip results, and the addition of CLIP_Image in Late Concat is beneficial.

However, so far BEIT3 underperforms clip. So I'm wondering if I am missing something.

For your BEIT experiments, what do you mean by Late Concat and Early(L1-L12), Early(L1-L24)? I can't find reference to this in the code, and neither in the beit repo or torchscale repo. If you could share a code sample you would really help to articulate your point

Thank you for your time

Nov 15 '24 11:11 MarcoForte

Hi, thank you for reproducing our work! Our BEIT experiment is to prove the effectiveness of "early-fusion", where "late" means use beit3 to extract separate single-modal feat and concat them. "Early(L1-L12)" means we only enable cross-modal attention in layer1~layer12 of beit3. "Early(L1-L24)" means we enable cross-modal attention in all layers of beit3, which is the original beit3. We implement by manually adding a masked attention map in beit3 source code. If you can provide part of your training codebase (both by issue or by email are ok), I can help you look for problems and fix bugs. But due to company policy, I cannot directly upload the training codebase.

Nov 15 '24 11:11 CoderZhangYx

@CoderZhangYx Thank you for the swift reply! You've cleared up a good deal of confusion for me. Not sure I'll be able to share code, but great to have that option.

For you experiments with CLIP did you also unfreeze the model?

Nov 15 '24 17:11 MarcoForte

we freeze sam during clip experiments.

Aug 13 '25 10:08 CoderZhangYx

EVF-SAM EVF-SAM copied to clipboard

BEIT-3-Large - Layer fusion

EVF-SAM
EVF-SAM copied to clipboard