Yiyuan Zhang
Yiyuan Zhang
add --user
Sorry, I did not try this.
zero-shot evaluation requires text-encoder.
Exactly, we use CLIP for pretraining
Maybe, I think the key is the proposed tokenizer.
We will release the training code soon. Please stay tuned.
Exactly. They are just random initialized vanilla positional embeddings.
These paired embeddings share the same weights to label the corresponding text paired datasets
Because they' re two matrix of text and multimodal features. Their dot products are transposed matrix. So for the columns and rows, the summations are different, especially dealing with a...
The match process is provided in our released code: https://github.com/invictus717/MiCo/blob/89c91c9dac68125a18a1a966bd80f9e74e584e80/model/mico.py#L44