Yiyuan Zhang

Results 17 comments of Yiyuan Zhang

zero-shot evaluation requires text-encoder.

Exactly, we use CLIP for pretraining

Maybe, I think the key is the proposed tokenizer.

We will release the training code soon. Please stay tuned.

Exactly. They are just random initialized vanilla positional embeddings.

These paired embeddings share the same weights to label the corresponding text paired datasets

Because they' re two matrix of text and multimodal features. Their dot products are transposed matrix. So for the columns and rows, the summations are different, especially dealing with a...

The match process is provided in our released code: https://github.com/invictus717/MiCo/blob/89c91c9dac68125a18a1a966bd80f9e74e584e80/model/mico.py#L44