open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

CoCa: fix MultimodalTransformer init + Mask CLS token at end of seq

Open iejMac opened this issue 1 year ago • 10 comments

iejMac avatar Jun 26 '23 13:06 iejMac

@gpucce thoughts? is there some way init_parameters could've somehow been called?

iejMac avatar Jun 26 '23 13:06 iejMac

No I think it used the default ones, I think the VisionTransformer doesn't call it either?

I mean it calls it but it does nothing

gpucce avatar Jun 26 '23 13:06 gpucce

but text calls, and also what confuses me is how text_projection works since its initialized with torch.empty (https://github.com/mlfoundations/open_clip/blob/fb72f4db1b17133befd6c67c9cf32a533b85a321/src/open_clip/transformer.py#L674)

iejMac avatar Jun 26 '23 13:06 iejMac

first fix the CI, then see my comment in #550

rom1504 avatar Jun 26 '23 14:06 rom1504

@iejMac I added one more change that should make this ready for the temptative retraining

gpucce avatar Jun 28 '23 12:06 gpucce

here is the run for this PR - https://wandb.ai//iejmac/open-clip/reports/CoCa-v2-fixes-comparison--Vmlldzo0NzY0ODIy

iejMac avatar Jun 29 '23 11:06 iejMac

@rom1504 thought on this? These changes don't brake current checkpoints, one issue, and actually initialize the MultimodalTransformer. I can try to start a run on Stability

iejMac avatar Jul 19 '23 13:07 iejMac

@rwightman @rom1504 @iejMac hi, I worked on this PR, as it is it has a few changes in tests, adds transformers compat and fixes the issues. This is the best working initialization I have found checking only up to 18 epochs.

If you have some time to check this would be useful

@JeniaJitsev I used a bit of laionize for this, but hopefully will have positive effect on mammut too.

This is the report of the first 18 epochs compared to the old coca run https://wandb.ai/gpucce/coca_tests/reports/CoCa-V2-ViT-B-32---Vmlldzo1MDc4NTkz

gpucce avatar Aug 07 '23 18:08 gpucce

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

rwightman avatar Oct 11 '23 19:10 rwightman

@gpucce so discussing here so I might possibly combine this with #660 checks, this was days before my second child was born so yeah, it got lost in the stack but I did take a peek (and subsequently forgot).

The cls mask change, what does it do to existing trained CoCa models?

The weight init was commented out in ViT because that's how OpenAI left it for the vision tower (and I thought we might try a different init some day), it relied on default PyTorch init. But there they used randn for all Param attributes

The empty is indeed a WTF. I feel following the text encoder init is a better default than going fully with PyTorch defaults for the multimodal decoder though right? Any reason why you wanted to comment it all out? Could just add the call to init_parameters() and tweak the projection init if zeros was more desirable?

The cls mask thing appears to not affect the Perf of existing models in both zeroshot-classification and captioning. And this is sort of expected because as it is it only prevents few tokens from attending to cls while the other way around works ok. I can rerun evals to confirm.

The .empty was my mistake from the final refactoring, I had a .Linear in the coca model and while moving it in transformer I lost it.

About the init I was going along with vision but indeed doing same as for text could be better, I couldn't run more experiments, even for the zero init could be that a small enough randn is slightly better. If you make it with text init and zeros a decision then I can run b-32 for some epochs as perf test on your PR branch. So works as a small sanity check on the PR effect on the model.

gpucce avatar Oct 11 '23 19:10 gpucce