open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

CoCa multimodal transformer layer implementation

Open ebsmothers opened this issue 2 years ago • 1 comments

Hi, thanks for your CoCa implementation! I have a question on the multimodal transformer: typically in a decoder layer I would expect to see self-attention, then cross-attention, then an MLP. But it seems like here a single layer is actually doing self-attention, MLP, cross-attention, then another MLP (since both resblock and cross_attn have an MLP). Is there a specific reason for doing it this way? Thanks in advance.

ebsmothers avatar Jul 19 '23 01:07 ebsmothers

Hi, @ebsmothers the main reason is that this was mostly inspired by https://github.com/lucidrains/CoCa-pytorch/blob/main/coca_pytorch/coca_pytorch.py which uses parallel feedforward instead of the classic one both in self and cross attention.

gpucce avatar Aug 07 '23 18:08 gpucce